Amy's Arxiv
Human-Computer Interaction
☆ A Sociotechnical Framework For Addressing Stigma and Designing Personalized Digital Health Products
Stigma, a recognized global barrier to effective disease management, impacts social interactions, resource access, and psychological well-being. In this study, we developed a patient-centered framework for deriving design requirements and interventions for health conditions subject to social stigma. This study introduces a patient-centered framework, grounded in sociotechnical systems theory, to create tailored interventions and design requirements for health conditions influenced by social stigma. We tested this framework through a mixed-method study on chronic pelvic pain patients. Our approach led to the identification of ten design requirements that encompass behavioral and psychological support and strategies for day-to-day living. The findings reveal a preference among CPP patients for priming and social support interventions. This study underscores the value of a systems-based perspective in healthcare, advocating for a nuanced, patient-centered approach that addresses the complex nature of health conditions affected by social stigma. It contributes to the ongoing discourse on integrating STS theory into healthcare frameworks, highlighting the need for targeted strategies to combat the complexities of stigma in patient care.
comment: 19 pages, 5 Tables, 1 Figure
☆ Towards Inclusive Video Commenting: Introducing Signmaku for the Deaf and Hard-of-Hearing CHI 2024
Previous research underscored the potential of danmaku--a text-based commenting feature on videos--in engaging hearing audiences. Yet, for many Deaf and hard-of-hearing (DHH) individuals, American Sign Language (ASL) takes precedence over English. To improve inclusivity, we introduce "Signmaku," a new commenting mechanism that uses ASL, serving as a sign language counterpart to danmaku. Through a need-finding study (N=12) and a within-subject experiment (N=20), we evaluated three design styles: real human faces, cartoon-like figures, and robotic representations. The results showed that cartoon-like signmaku not only entertained but also encouraged participants to create and share ASL comments, with fewer privacy concerns compared to the other designs. Conversely, the robotic representations faced challenges in accurately depicting hand movements and facial expressions, resulting in higher cognitive demands on users. Signmaku featuring real human faces elicited the lowest cognitive load and was the most comprehensible among all three types. Our findings offered novel design implications for leveraging generative AI to create signmaku comments, enriching co-learning experiences for DHH individuals.
comment: 14 pages, CHI 2024
☆ SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings CHI
Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants' feedback further offers valuable design insights for future systems aiming to enhance caption writing.
comment: CHI EA '24: Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems
☆ Exploring the Boundaries of Ambient Awareness in Twitter
Ambient awareness refers to the ability of social media users to obtain knowledge about who knows what (i.e., users' expertise) in their network, by simply being exposed to other users' content (e.g, tweets on Twitter). Previous work, based on user surveys, reveals that individuals self-report ambient awareness only for parts of their networks. However, it is unclear whether it is their limited cognitive capacity or the limited exposure to diagnostic tweets (i.e., online content) that prevents people from developing ambient awareness for their complete network. In this work, we focus on in-wall ambient awareness (IWAA) in Twitter and conduct a two-step data-driven analysis, that allows us to explore to which extent IWAA is likely, or even possible. First, we rely on reactions (e.g., likes), as strong evidence of users being aware of experts in Twitter. Unfortunately, such strong evidence can be only measured for active users, which represent the minority in the network. Thus to study the boundaries of IWAA to a larger extent, in the second part of our analysis, we instead focus on the passive exposure to content generated by other users -- which we refer to as in-wall visibility. This analysis shows that (in line with \citet{levordashka2016ambient}) only for a subset of users IWAA is plausible, while for the majority it is unlikely, if even possible, to develop IWAA. We hope that our methodology paves the way for the emergence of data-driven approaches for the study of ambient awareness.
☆ FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts
Quickly understanding lengthy lecture videos is essential for learners with limited time and interest in various topics to improve their learning efficiency. To this end, video summarization has been actively researched to enable users to view only important scenes from a video. However, these studies focus on either the visual or audio information of a video and extract important segments in the video. Therefore, there is a risk of missing important information when both the teacher's speech and visual information on the blackboard or slides are important, such as in a lecture video. To tackle this issue, we propose FastPerson, a video summarization approach that considers both the visual and auditory information in lecture videos. FastPerson creates summary videos by utilizing audio transcriptions along with on-screen images and text, minimizing the risk of overlooking crucial information for learners. Further, it provides a feature that allows learners to switch between the summary and original videos for each chapter of the video, enabling them to adjust the pace of learning based on their interests and level of understanding. We conducted an evaluation with 40 participants to assess the effectiveness of our method and confirmed that it reduced viewing time by 53\% at the same level of comprehension as that when using traditional video playback methods.
☆ Evaluating Authoring Tools with the Explorable Authoring Requirements SC
Explorables with interactive, multimodal content, openly available on the web, are a promising medium for education. Yet authoring such explorables requires web development expertise, excluding most educators and students from the authoring and remixing process. Some tools are available to reduce this barrier of entry and others are in development, making a method to evaluate these new tools necessary. On the basis of the software quality model ISO 25010, empirical results, and domain modeling, we derive the Explorable Authoring Requirements (EAR) as a requirements catalogue explorable authoring tools should implement. We then outline a future research design to operationalize EAR.
comment: 12 pages plus references, preprint of paper for NaKoDi2022 (adjunct to WiPSCE 2022 conference)
☆ Panonut360: A Head and Eye Tracking Dataset for Panoramic Video ACM MM
With the rapid development and widespread application of VR/AR technology, maximizing the quality of immersive panoramic video services that match users' personal preferences and habits has become a long-standing challenge. Understanding the saliency region where users focus, based on data collected with HMDs, can promote multimedia encoding, transmission, and quality assessment. At the same time, large-scale datasets are essential for researchers and developers to explore short/long-term user behavior patterns and train AI models related to panoramic videos. However, existing panoramic video datasets often include low-frequency user head or eye movement data through short-term videos only, lacking sufficient data for analyzing users' Field of View (FoV) and generating video saliency regions. Driven by these practical factors, in this paper, we present a head and eye tracking dataset involving 50 users (25 males and 25 females) watching 15 panoramic videos. The dataset provides details on the viewport and gaze attention locations of users. Besides, we present some statistics samples extracted from the dataset. For example, the deviation between head and eye movements challenges the widely held assumption that gaze attention decreases from the center of the FoV following a Gaussian distribution. Our analysis reveals a consistent downward offset in gaze fixations relative to the FoV in experimental settings involving multiple users and videos. That's why we name the dataset Panonut, a saliency weighting shaped like a donut. Finally, we also provide a script that generates saliency distributions based on given head or eye coordinates and pre-generated saliency distribution map sets of each video from the collected eye tracking data. The dataset is available on website: https://dianvrlab.github.io/Panonut360/.
comment: 7 pages,ACM MMSys'24 accepted
☆ ExpressEdit: Video Editing with Natural Language and Sketching
Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.
comment: 22 pages, 5 figures, to be published in ACM IUI 2024
☆ Coimagining the Future of Voice Assistants with Cultural Sensitivity
Voice assistants (VAs) are becoming a feature of our everyday life. Yet, the user experience (UX) is often limited, leading to underuse, disengagement, and abandonment. Co-designing interactions for VAs with potential end-users can be useful. Crowdsourcing this process online and anonymously may add value. However, most work has been done in the English-speaking West on dialogue data sets. We must be sensitive to cultural differences in language, social interactions, and attitudes towards technology. Our aims were to explore the value of co-designing VAs in the non-Western context of Japan and demonstrate the necessity of cultural sensitivity. We conducted an online elicitation study (N = 135) where Americans (n = 64) and Japanese people (n = 71) imagined dialogues (N = 282) and activities (N = 73) with future VAs. We discuss the implications for coimagining interactions with future VAs, offer design guidelines for the Japanese and English-speaking US contexts, and suggest opportunities for cultural plurality in VA design and scholarship.
comment: 21 pages
☆ DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images
We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360{\deg} images based on a conditional score-based denoising diffusion model. Generating human gaze on 360{\deg} images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360{\deg} images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360{\deg} image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences.
☆ Cognitively Biased Users Interacting with Algorithmically Biased Results in Whole-Session Search on Controversial Topics
When interacting with information retrieval (IR) systems, users, affected by confirmation biases, tend to select search results that confirm their existing beliefs on socially significant contentious issues. To understand the judgments and attitude changes of users searching online, our study examined how cognitively biased users interact with algorithmically biased search engine result pages (SERPs). We designed three-query search sessions on debated topics under various bias conditions. We recruited 1,321 crowdsourcing participants and explored their attitude changes, search interactions, and the effects of confirmation bias. Three key findings emerged: 1) most attitude changes occur in the initial query of a search session; 2) confirmation bias and result presentation on SERPs affect search behaviors in the current query and perceived familiarity with clicked results in subsequent queries. The bias position also affect attitude changes of users with lower perceived openness to conflicting opinions; 3) Interactions in the first query and and dwell time throughout the session are associated with users' attitude changes in different forms. Our study goes beyond traditional simulation-based evaluation settings and simulated rational users, sheds light on the mixed effects of human biases and algorithmic biases in controversial information retrieval tasks, and can inform the design of bias-aware user models, human-centered bias mitigation techniques, and socially responsible intelligent IR systems.
♻ ☆ As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli
As synthetic media becomes progressively more realistic and barriers to using it continue to lower, the technology has been increasingly utilized for malicious purposes, from financial fraud to nonconsensual pornography. Today, the principal defense against being misled by synthetic media relies on the ability of the human observer to visually and auditorily discern between real and fake. However, it remains unclear just how vulnerable people actually are to deceptive synthetic media in the course of their day to day lives. We conducted a perceptual study with 1276 participants to assess how accurate people were at distinguishing synthetic images, audio only, video only, and audiovisual stimuli from authentic. To reflect the circumstances under which people would likely encounter synthetic media in the wild, testing conditions and stimuli emulated a typical online platform, while all synthetic media used in the survey was sourced from publicly accessible generative AI technology. We find that overall, participants struggled to meaningfully discern between synthetic and authentic content. We also find that detection performance worsens when the stimuli contains synthetic content as compared to authentic content, images featuring human faces as compared to non face objects, a single modality as compared to multimodal stimuli, mixed authenticity as compared to being fully synthetic for audiovisual stimuli, and features foreign languages as compared to languages the observer is fluent in. Finally, we also find that prior knowledge of synthetic media does not meaningfully impact their detection performance. Collectively, these results indicate that people are highly susceptible to being tricked by synthetic media in their daily lives and that human perceptual detection capabilities can no longer be relied upon as an effective counterdefense.
comment: For study pre-registration, see https://osf.io/fnhr3
♻ ☆ App Planner: Utilizing Generative AI in K-12 Mobile App Development Education
App Planner is an interactive support tool for K-12 students, designed to assist in creating mobile applications. By utilizing generative AI, App Planner helps students articulate the problem and solution through guided conversations via a chat-based interface. It assists them in brainstorming and formulating new ideas for applications, provides feedback on those ideas, and stimulates creative thinking. Here we report usability tests from our preliminary study with high-school students who appreciated App Planner for aiding the app design process and providing new viewpoints on human aspects especially the potential negative impact of their creation.
comment: 7 pages, 1 figure, 1 table
♻ ☆ A Design Space for Intelligent and Interactive Writing Assistants CHI 2024
In our era of rapid technological advancement, the research landscape for writing assistants has become increasingly fragmented across various research communities. We seek to address this challenge by proposing a design space as a structured way to examine and explore the multidimensional space of intelligent and interactive writing assistants. Through a large community collaboration, we explore five aspects of writing assistants: task, user, technology, interaction, and ecosystem. Within each aspect, we define dimensions (i.e., fundamental components of an aspect) and codes (i.e., potential options for each dimension) by systematically reviewing 115 papers. Our design space aims to offer researchers and designers a practical tool to navigate, comprehend, and compare the various possibilities of writing assistants, and aid in the envisioning and design of new writing assistants.
comment: Published as a conference paper at CHI 2024
♻ ☆ A Critical Reflection on the Use of Toxicity Detection Algorithms in Proactive Content Moderation Systems
Toxicity detection algorithms, originally designed with reactive content moderation in mind, are increasingly being deployed into proactive end-user interventions to moderate content. Through a socio-technical lens and focusing on contexts in which they are applied, we explore the use of these algorithms in proactive moderation systems. Placing a toxicity detection algorithm in an imagined virtual mobile keyboard, we critically explore how such algorithms could be used to proactively reduce the sending of toxic content. We present findings from design workshops conducted with four distinct stakeholder groups and find concerns around how contextual complexities may exasperate inequalities around content moderation processes. Whilst only specific user groups are likely to directly benefit from these interventions, we highlight the potential for other groups to misuse them to circumvent detection, validate and gamify hate, and manipulate algorithmic models to exasperate harm.
♻ ☆ Learning User Embeddings from Human Gaze for Personalised Saliency Prediction
Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.
♻ ☆ Towards Massive Interaction with Generalist Robotics: A Systematic Review of XR-enabled Remote Human-Robot Interaction Systems
The rising interest of generalist robots seek to create robots with versatility to handle multiple tasks in a variety of environments, and human will interact with such robots through immersive interfaces. In the context of human-robot interaction (HRI), this survey provides an exhaustive review of the applications of extended reality (XR) technologies in the field of remote HRI. We developed a systematic search strategy based on the PRISMA methodology. From the initial 2,561 articles selected, 100 research papers that met our inclusion criteria were included. We categorized and summarized the domain in detail, delving into XR technologies, including augmented reality (AR), virtual reality (VR), and mixed reality (MR), and their applications in facilitating intuitive and effective remote control and interaction with robotic systems. The survey highlights existing articles on the application of XR technologies, user experience enhancement, and various interaction designs for XR in remote HRI, providing insights into current trends and future directions. We also identified potential gaps and opportunities for future research to improve remote HRI systems through XR technology to guide and inform future XR and robotics research.
♻ ☆ Data journeys in popular science: Producing climate change and COVID-19 data visualizations at Scientific American
Vast amounts of (open) data are increasingly used to make arguments about crisis topics such as climate change and global pandemics. Data visualizations are central to bringing these viewpoints to broader publics. However, visualizations often conceal the many contexts involved in their production, ranging from decisions made in research labs about collecting and sharing data to choices made in editorial rooms about which data stories to tell. In this paper, we examine how data visualizations about climate change and COVID-19 are produced in popular science magazines, using Scientific American, an established English-language popular science magazine, as a case study. To do this, we apply the analytical concept of data journeys (Leonelli, 2020) in a mixed methods study that centers on interviews with Scientific American staff and is supplemented by a visualization analysis of selected charts. In particular, we discuss the affordances of working with open data, the role of collaborative data practices, and how the magazine works to counter misinformation and increase transparency. This work provides an empirical contribution by providing insight into the data (visualization) practices of science communicators and demonstrating how the concept of data journeys can be used as an analytical framework.
comment: 44 pages, 4 figures, 3 boxes
♻ ☆ The opportunities and risks of large language models in mental health
Global rates of mental health concerns are rising and there is increasing realization that existing models of mental healthcare will not adequately expand to meet the demand. With the emergence of large language models (LLMs) has come great optimism regarding their promise to create novel, large-scale solutions to support mental health. Despite their nascence, LLMs have already been applied to mental health-related tasks. In this review, we summarize the extant literature on efforts to use LLMs to provide mental health education, assessment, and intervention and highlight key opportunities for positive impact in each area. We then highlight risks associated with LLMs application to mental health and encourage adoption of strategies to mitigate these risks. The urgent need for mental health support must be balanced with responsible development, testing, and deployment of mental health LLMs. Especially critical is ensuring that mental health LLMs are fine-tuned for mental health, enhance mental health equity, adhere to ethical standards, and that people, including those with lived experience with mental health concerns, are involved in all stages from development through deployment. Prioritizing these efforts will minimize potential harms to mental health and maximize the likelihood that LLMs will positively impact mental health globally.
comment: 12 pages, 2 tables, 4 figures
Social and Information Networks
☆ Exploring the Boundaries of Ambient Awareness in Twitter
Ambient awareness refers to the ability of social media users to obtain knowledge about who knows what (i.e., users' expertise) in their network, by simply being exposed to other users' content (e.g, tweets on Twitter). Previous work, based on user surveys, reveals that individuals self-report ambient awareness only for parts of their networks. However, it is unclear whether it is their limited cognitive capacity or the limited exposure to diagnostic tweets (i.e., online content) that prevents people from developing ambient awareness for their complete network. In this work, we focus on in-wall ambient awareness (IWAA) in Twitter and conduct a two-step data-driven analysis, that allows us to explore to which extent IWAA is likely, or even possible. First, we rely on reactions (e.g., likes), as strong evidence of users being aware of experts in Twitter. Unfortunately, such strong evidence can be only measured for active users, which represent the minority in the network. Thus to study the boundaries of IWAA to a larger extent, in the second part of our analysis, we instead focus on the passive exposure to content generated by other users -- which we refer to as in-wall visibility. This analysis shows that (in line with \citet{levordashka2016ambient}) only for a subset of users IWAA is plausible, while for the majority it is unlikely, if even possible, to develop IWAA. We hope that our methodology paves the way for the emergence of data-driven approaches for the study of ambient awareness.
☆ Distance-Based Hierarchical Cutting of Complex Networks with Non-Preferential and Preferential Choice of Seeds
Graphs and complex networks can be successively separated into connected components associated to respective seed nodes, therefore establishing a respective hierarchical organization. In the present work, we study the properties of the hierarchical structure implied by distance-based cutting of Erd\H{o}s-R\'enyi, Barab\'asi-Albert, and a specific geometric network. Two main situations are considered regarding the choice of the seeds: non-preferential and preferential to the respective node degree. Among the obtained findings, we have the tendency of geometrical networks yielding more balanced pairs of connected components along the network progressive separation, presenting little chaining effects, followed by the Erd\H{o}s-R\'enyi and Barab\'asi-Albert types of networks. The choice of seeds preferential to the node degree tended to enhance the balance of the connected components in the case of the geometrical networks.
comment: 15 pages and 9 figures
☆ Decoding excellence: Mapping the demand for psychological traits of operations and supply chain professionals through text mining
The current study proposes an innovative methodology for the profiling of psychological traits of Operations Management (OM) and Supply Chain Management (SCM) professionals. We use innovative methods and tools of text mining and social network analysis to map the demand for relevant skills from a set of job descriptions, with a focus on psychological characteristics. The proposed approach aims to evaluate the market demand for specific traits by combining relevant psychological constructs, text mining techniques, and an innovative measure, namely, the Semantic Brand Score. We apply the proposed methodology to a dataset of job descriptions for OM and SCM professionals, with the objective of providing a mapping of their relevant required skills, including psychological characteristics. In addition, the analysis is then detailed by considering the region of the organization that issues the job description, its organizational size, and the seniority level of the open position in order to understand their nuances. Finally, topic modeling is used to examine key components and their relative significance in job descriptions. By employing a novel methodology and considering contextual factors, we provide an innovative understanding of the attitudinal traits that differentiate professionals. This research contributes to talent management, recruitment practices, and professional development initiatives, since it provides new figures and perspectives to improve the effectiveness and success of Operations Management and Supply Chain Management professionals.
☆ $(ω_1, ω_2)$-Temporal random hyperbolic graphs
We extend a recent model of temporal random hyperbolic graphs by allowing connections and disconnections to persist across network snapshots with different probabilities, $\omega_1$ and $\omega_2$. This extension, while conceptually simple, poses analytical challenges involving the Appell $F_1$ series. Despite these challenges, we are able to analyze key properties of the model, which include the distributions of contact and intercontact durations, as well as the expected time-aggregated degree. The incorporation of $\omega_1$ and $\omega_2$ enables more flexible tuning of the average contact and intercontact durations, and of the average time-aggregated degree, providing a finer control for exploring the effect of temporal network dynamics on epidemic processes. Overall, our results provide new insights into the analysis of temporal networks and contribute to a more general representation of real-world scenarios.
☆ Learn from Heterophily: Heterophilous Information-enhanced Graph Neural Network
Under circumstances of heterophily, where nodes with different labels tend to be connected based on semantic meanings, Graph Neural Networks (GNNs) often exhibit suboptimal performance. Current studies on graph heterophily mainly focus on aggregation calibration or neighbor extension and address the heterophily issue by utilizing node features or structural information to improve GNN representations. In this paper, we propose and demonstrate that the valuable semantic information inherent in heterophily can be utilized effectively in graph learning by investigating the distribution of neighbors for each individual node within the graph. The theoretical analysis is carried out to demonstrate the efficacy of the idea in enhancing graph learning. Based on this analysis, we propose HiGNN, an innovative approach that constructs an additional new graph structure, that integrates heterophilous information by leveraging node distribution to enhance connectivity between nodes that share similar semantic characteristics. We conduct empirical assessments on node classification tasks using both homophilous and heterophilous benchmark datasets and compare HiGNN to popular GNN baselines and SoTA methods, confirming the effectiveness in improving graph representations. In addition, by incorporating heterophilous information, we demonstrate a notable enhancement in existing GNN-based approaches, and the homophily degree across real-world datasets, thus affirming the efficacy of our approach.
♻ ☆ LocalTweets to LocalHealth: A Mental Health Surveillance Framework Based on Twitter Data
Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78\%, respectively, a 59\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.
♻ ☆ A Decade of Scholarly Research on Open Knowledge Graphs LREC
The proliferation of open knowledge graphs has led to a surge in scholarly research on the topic over the past decade. This paper presents a bibliometric analysis of the scholarly literature on open knowledge graphs published between 2013 and 2023. The study aims to identify the trends, patterns, and impact of research in this field, as well as the key topics and research questions that have emerged. The work uses bibliometric techniques to analyze a sample of 4445 scholarly articles retrieved from Scopus. The findings reveal an ever-increasing number of publications on open knowledge graphs published every year, particularly in developed countries (+50 per year). These outputs are published in highly-referred scholarly journals and conferences. The study identifies three main research themes: (1) knowledge graph construction and enrichment, (2) evaluation and reuse, and (3) fusion of knowledge graphs into NLP systems. Within these themes, the study identifies specific tasks that have received considerable attention, including entity linking, knowledge graph embedding, and graph neural networks.
comment: Camera-ready edition for LREC-COLING 2024
♻ ☆ Ellipsoidal embeddings of graphs
Due to their flexibility to represent almost any kind of relational data, graph-based models have enjoyed a tremendous success over the past decades. While graphs are inherently only combinatorial objects, however, many prominent analysis tools are based on the algebraic representation of graphs via matrices such as the graph Laplacian, or on associated graph embeddings. Such embeddings associate to each node a set of coordinates in a vector space, a representation which can then be employed for learning tasks such as the classification or alignment of the nodes of the graph. As the geometric picture provided by embedding methods enables the use of a multitude of methods developed for vector space data, embeddings have thus gained interest both from a theoretical as well as a practical perspective. Inspired by trace-optimization problems, often encountered in the analysis of graph-based data, here we present a method to derive ellipsoidal embeddings of the nodes of a graph, in which each node is assigned a set of coordinates on the surface of a hyperellipsoid. Our method may be seen as an alternative to popular spectral embedding techniques, to which it shares certain similarities we discuss. To illustrate the utility of the embedding we conduct a case study in which we analyse synthetic and real world networks with modular structure, and compare the results obtained with known methods in the literature.
comment: 29 pages, 6 figures. A few typos corrected
♻ ☆ Graph Generation with $K^2$-trees ICLR
Generating graphs from a target distribution is a significant challenge across many domains, including drug discovery and social network analysis. In this work, we introduce a novel graph generation method leveraging $K^2$-tree representation, originally designed for lossless graph compression. The $K^2$-tree representation {encompasses inherent hierarchy while enabling compact graph generation}. In addition, we make contributions by (1) presenting a sequential $K^2$-treerepresentation that incorporates pruning, flattening, and tokenization processes and (2) introducing a Transformer-based architecture designed to generate the sequence by incorporating a specialized tree positional encoding scheme. Finally, we extensively evaluate our algorithm on four general and two molecular graph datasets to confirm its superiority for graph generation.
comment: International Conference on Learning Representations (ICLR) 2024
♻ ☆ Large Language Models in Biomedical and Health Informatics: A Bibliometric Review
Large Language Models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), enabling new ways to analyze data, treat patients, and conduct research. This bibliometric review aims to provide a panoramic view of how LLMs have been used in BHI by examining research articles and collaboration networks from 2022 to 2023. It further explores how LLMs can improve Natural Language Processing (NLP) applications in various BHI areas like medical diagnosis, patient engagement, electronic health record management, and personalized medicine. To do this, our bibliometric review identifies key trends, maps out research networks, and highlights major developments in this fast-moving field. Lastly, it discusses the ethical concerns and practical challenges of using LLMs in BHI, such as data privacy and reliable medical recommendations. Looking ahead, we consider how LLMs could further transform biomedical research as well as healthcare delivery and patient outcomes. This bibliometric review serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
comment: 50 pages, 7 figures, 4 tables
♻ ☆ Opinion Market Model: Stemming Far-Right Opinion Spread using Positive Interventions AAAI
Online extremism has severe societal consequences, including normalizing hate speech, user radicalization, and increased social divisions. Various mitigation strategies have been explored to address these consequences. One such strategy uses positive interventions: controlled signals that add attention to the opinion ecosystem to boost certain opinions. To evaluate the effectiveness of positive interventions, we introduce the Opinion Market Model (OMM), a two-tier online opinion ecosystem model that considers both inter-opinion interactions and the role of positive interventions. The size of the opinion attention market is modeled in the first tier using the multivariate discrete-time Hawkes process; in the second tier, opinions cooperate and compete for market share, given limited attention using the market share attraction model. We demonstrate the convergence of our proposed estimation scheme on a synthetic dataset. Next, we test OMM on two learning tasks, applying to two real-world datasets to predict attention market shares and uncover latent relationships between online items. The first dataset comprises Facebook and Twitter discussions containing moderate and far-right opinions about bushfires and climate change. The second dataset captures popular VEVO artists' YouTube and Twitter attention volumes. OMM outperforms the state-of-the-art predictive models on both datasets and captures latent cooperation-competition relations. We uncover (1) self- and cross-reinforcement between far-right and moderate opinions on the bushfires and (2) pairwise artist relations that correlate with real-world interactions such as collaborations and long-lasting feuds. Lastly, we use OMM as a testbed for positive interventions and show how media coverage modulates the spread of far-right opinions.
comment: accepted in the 18th AAAI International Conference on Web and Social Media (ICWSM'24)
Distributed, Parallel, and Cluster Computing
☆ Empowering Data Mesh with Federated Learning KDD
The evolution of data architecture has seen the rise of data lakes, aiming to solve the bottlenecks of data management and promote intelligent decision-making. However, this centralized architecture is limited by the proliferation of data sources and the growing demand for timely analysis and processing. A new data paradigm, Data Mesh, is proposed to overcome these challenges. Data Mesh treats domains as a first-class concern by distributing the data ownership from the central team to each data domain, while keeping the federated governance to monitor domains and their data products. Many multi-million dollar organizations like Paypal, Netflix, and Zalando have already transformed their data analysis pipelines based on this new architecture. In this decentralized architecture where data is locally preserved by each domain team, traditional centralized machine learning is incapable of conducting effective analysis across multiple domains, especially for security-sensitive organizations. To this end, we introduce a pioneering approach that incorporates Federated Learning into Data Mesh. To the best of our knowledge, this is the first open-source applied work that represents a critical advancement toward the integration of federated learning methods into the Data Mesh paradigm, underscoring the promising prospects for privacy-preserving and decentralized data analysis strategies within Data Mesh architecture.
comment: In Proceedings of ACM Knowledge Discovery and Data Mining, Barcelona, Spain, 25th - 29th August, 2024 (Conference acronym KDD), 9 pages
☆ An AI-Native Runtime for Multi-Wearable Environments
The miniaturization of AI accelerators is paving the way for next-generation wearable applications within wearable technologies. We introduce Mojito, an AI-native runtime with advanced MLOps designed to facilitate the development and deployment of these applications on wearable devices. It emphasizes the necessity of dynamic orchestration of distributed resources equipped with ultra-low-power AI accelerators to overcome challenges associated with unpredictable runtime environments. Through its innovative approaches, Mojito demonstrates how future wearable technologies can evolve to be more autonomous.
comment: 7 pages, 4 figures
☆ GPFL: A Gradient Projection-Based Client Selection Framework for Efficient Federated Learning
Federated learning client selection is crucial for determining participant clients while balancing model accuracy and communication efficiency. Existing methods have limitations in handling data heterogeneity, computational burdens, and independent client treatment. To address these challenges, we propose GPFL, which measures client value by comparing local and global descent directions. We also employ an Exploit-Explore mechanism to enhance performance. Experimental results on FEMINST and CIFAR-10 datasets demonstrate that GPFL outperforms baselines in Non-IID scenarios, achieving over 9\% improvement in FEMINST test accuracy. Moreover, GPFL exhibits shorter computation times through pre-selection and parameter reuse in federated learning.
comment: 8 pages, 5 figures
☆ SPES: Towards Optimizing Performance-Resource Trade-Off for Serverless Functions ICDE 2024
As an emerging cloud computing deployment paradigm, serverless computing is gaining traction due to its efficiency and ability to harness on-demand cloud resources. However, a significant hurdle remains in the form of the cold start problem, causing latency when launching new function instances from scratch. Existing solutions tend to use over-simplistic strategies for function pre-loading/unloading without full invocation pattern exploitation, rendering unsatisfactory optimization of the trade-off between cold start latency and resource waste. To bridge this gap, we propose SPES, the first differentiated scheduler for runtime cold start mitigation by optimizing serverless function provision. Our insight is that the common architecture of serverless systems prompts the con- centration of certain invocation patterns, leading to predictable invocation behaviors. This allows us to categorize functions and pre-load/unload proper function instances with finer-grained strategies based on accurate invocation prediction. Experiments demonstrate the success of SPES in optimizing serverless function provision on both sides: reducing the 75th-percentile cold start rates by 49.77% and the wasted memory time by 56.43%, compared to the state-of-the-art. By mitigating the cold start issue, SPES is a promising advancement in facilitating cloud services deployed on serverless architectures.
comment: 12 pages, accepted by ICDE 2024 (40th IEEE International Conference on Data Engineering)
☆ A Survey on Resource Management in Joint Communication and Computing-Embedded SAGIN
The advent of the 6G era aims for ubiquitous connectivity, with the integration of non-terrestrial networks (NTN) offering extensive coverage and enhanced capacity. As manufacturing advances and user demands evolve, space-air-ground integrated networks (SAGIN) with computational capabilities emerge as a viable solution for services requiring low latency and high computational power. Resource management within joint communication and computing-embedded SAGIN (JCC-SAGIN) presents greater complexity than traditional terrestrial networks. This complexity arises from the spatiotemporal dynamics of network topology and service demand, the interdependency of large-scale resource variables, and intricate tradeoffs among various performance metrics. Thus, a thorough examination of resource management strategies in JCC-SAGIN is crucial, emphasizing the role of non-terrestrial platforms with processing capabilities in 6G. This paper begins by reviewing the architecture, enabling technologies, and applications in JCC-SAGIN. Then, we offer a detailed overview of resource management modeling and optimization methods, encompassing both traditional optimization approaches and learning-based intelligent decision-making frameworks. Finally, we outline the prospective research directions in JCC-SAGIN.
comment: 43 pages, 17 figures
☆ FedMIL: Federated-Multiple Instance Learning for Video Analysis with Optimized DPP Scheduling
Many AI platforms, including traffic monitoring systems, use Federated Learning (FL) for decentralized sensor data processing for learning-based applications while preserving privacy and ensuring secured information transfer. On the other hand, applying supervised learning to large data samples, like high-resolution images requires intensive human labor to label different parts of a data sample. Multiple Instance Learning (MIL) alleviates this challenge by operating over labels assigned to the 'bag' of instances. In this paper, we introduce Federated Multiple-Instance Learning (FedMIL). This framework applies federated learning to boost the training performance in video-based MIL tasks such as vehicle accident detection using distributed CCTV networks. However, data sources in decentralized settings are not typically Independently and Identically Distributed (IID), making client selection imperative to collectively represent the entire dataset with minimal clients. To address this challenge, we propose DPPQ, a framework based on the Determinantal Point Process (DPP) with a quality-based kernel to select clients with the most diverse datasets that achieve better performance compared to both random selection and current DPP-based client selection methods even with less data utilization in the majority of non-IID cases. This offers a significant advantage for deployment on edge devices with limited computational resources, providing a reliable solution for training AI models in massive smart sensor networks.
☆ Not All Federated Learning Algorithms Are Created Equal: A Performance Evaluation Study
Federated Learning (FL) emerged as a practical approach to training a model from decentralized data. The proliferation of FL led to the development of numerous FL algorithms and mechanisms. Many prior efforts have given their primary focus on accuracy of those approaches, but there exists little understanding of other aspects such as computational overheads, performance and training stability, etc. To bridge this gap, we conduct extensive performance evaluation on several canonical FL algorithms (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, and FedDyn) by leveraging an open-source federated learning framework called Flame. Our comprehensive measurement study reveals that no single algorithm works best across different performance metrics. A few key observations are: (1) While some state-of-the-art algorithms achieve higher accuracy than others, they incur either higher computation overheads (FedDyn) or communication overheads (SCAFFOLD). (2) Recent algorithms present smaller standard deviation in accuracy across clients than FedAvg, indicating that the advanced algorithms' performances are stable. (3) However, algorithms such as FedDyn and SCAFFOLD are more prone to catastrophic failures without the support of additional techniques such as gradient clipping. We hope that our empirical study can help the community to build best practices in evaluating FL algorithms.
♻ ☆ Activations and Gradients Compression for Model-Parallel Training
Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $ K=30\%$ worsens model performance significantly.
comment: 17 pages, 6 figures, 5 tables
♻ ☆ A simple GPU implementation of spectral-element methods for solving 3D Poisson type equations on rectangular domains and its applications
It is well known since 1960s that by exploring the tensor product structure of the discrete Laplacian on Cartesian meshes, one can develop a simple direct Poisson solver with an $\mathcal O(N^{\frac{d+1}d})$ complexity in d-dimension, where N is the number of the total unknowns. The GPU acceleration of numerically solving PDEs has been explored successfully around fifteen years ago and become more and more popular in the past decade, driven by significant advancement in both hardware and software technologies, especially in the recent few years. We present in this paper a simple but extremely fast MATLAB implementation on a modern GPU, which can be easily reproduced, for solving 3D Poisson type equations using a spectral-element method. In particular, it costs less than one second on a Nvidia A100 for solving a Poisson equation with one billion degree of freedoms. We also present applications of this fast solver to solve a linear (time-independent) Schr\"odinger equation and a nonlinear (time-dependent) Cahn-Hilliard equation.
♻ ☆ Parallel Self-Avoiding Walks for a Low-Autocorrelation Binary Sequences Problem
A low-autocorrelation binary sequences problem with a high figure of merit factor represents a formidable computational challenge. An efficient parallel computing algorithm is required to reach the new best-known solutions for this problem. Therefore, we developed the $\mathit{sokol}_{\mathit{skew}}$ solver for the skew-symmetric search space. The developed solver takes the advantage of parallel computing on graphics processing units. The solver organized the search process as a sequence of parallel and contiguous self-avoiding walks and achieved a speedup factor of 387 compared with $\mathit{lssOrel}$, its predecessor. The $\mathit{sokol}_{\mathit{skew}}$ solver belongs to stochastic solvers and can not guarantee the optimality of solutions. To mitigate this problem, we established the predictive model of stopping conditions according to the small instances for which the optimal skew-symmetric solutions are known. With its help and 99% probability, the $\mathit{sokol}_{\mathit{skew}}$ solver found all the known and seven new best-known skew-symmetric sequences for odd instances from $L=121$ to $L=223$. For larger instances, the solver can not reach 99% probability within our limitations, but it still found several new best-known binary sequences. We also analyzed the trend of the best merit factor values, and it shows that as sequence size increases, the value of the merit factor also increases, and this trend is flatter for larger instances.
comment: 17 pages, 5 figures
♻ ☆ DISL: Fueling Research with A Large Dataset of Solidity Smart Contracts
The DISL dataset features a collection of $514,506$ unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts. By aggregating every verified smart contract from Etherscan up to January 15, 2024, DISL surpasses existing datasets in size and recency.
♻ ☆ P-TimeSync: A Precise Time Synchronization Simulation with Network Propagation Delays
Time serves as the foundation of modern society and will continue to grow in value in the future world. Unlike previous research papers, authors delve into various time sources, ranging from atomic time and GPS time to quartz time. Specifically, we explore the time uncertainty associated with the four major Global Navigation Satellite Systems. In existing time synchronization simulations provide partial usages. However, our research introduces a comprehensive and precise time synchronization simulation named P-TimeSync, leading to a better understanding of time synchronization in distributed environments. It is a state-of-the-art simulation tool for time because (1) it can simulate atomic clocks and quartz clocks with user-defined software clock algorithms, (2) the simulation provides nanosecond-level precision time across different network propagation paths and distances, (3) the tool offers a visualization platform with classic algorithms for distributed time synchronization, such as Cristian's algorithm and Berkeley algorithm. The simulation easily allows for the redefinition of configurations and functions, supporting advanced research and development. The simulation tool could be downloaded via the website: https://github.com/rui5097/purdue_timesync
comment: 6 pages, 5 figures, conference paper
♻ ☆ Securing Blockchain Systems: A Novel Collaborative Learning Framework to Detect Attacks in Transactions and Smart Contracts
With the escalating prevalence of malicious activities exploiting vulnerabilities in blockchain systems, there is an urgent requirement for robust attack detection mechanisms. To address this challenge, this paper presents a novel collaborative learning framework designed to detect attacks in blockchain transactions and smart contracts by analyzing transaction features. Our framework exhibits the capability to classify various types of blockchain attacks, including intricate attacks at the machine code level (e.g., injecting malicious codes to withdraw coins from users unlawfully), which typically necessitate significant time and security expertise to detect. To achieve that, the proposed framework incorporates a unique tool that transforms transaction features into visual representations, facilitating efficient analysis and classification of low-level machine codes. Furthermore, we propose a customized collaborative learning model to enable real-time detection of diverse attack types at distributed mining nodes. In order to create a comprehensive dataset, we deploy a pilot system based on a private Ethereum network and conduct multiple attack scenarios. To the best of our knowledge, our dataset is the most comprehensive and diverse collection of transactions and smart contracts synthesized in a laboratory for cyberattack detection in blockchain systems. Our framework achieves a detection accuracy of approximately 94\% through extensive simulations and real-time experiments with a throughput of over 2,150 transactions per second. These compelling results validate the efficacy of our framework and showcase its adaptability in addressing real-world cyberattack scenarios.
Machine Learning
☆ SLEDGE: Synthesizing Simulation Environments for Driving Agents with Generative Models
SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder (RVAE). It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500$\times$ less storage to set up (<4GB), making it a more accessible option and helping with democratizing future research in this field.
☆ The Need for Speed: Pruning Transformers with One Recipe ICLR
We introduce the $\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}$etworks ($\textbf{OPTIN}$) framework as a tool to increase the efficiency of pre-trained transformer architectures $\textit{without requiring re-training}$. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined $\textit{trajectory}$), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks $\textit{without re-training}$. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a $\leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without $\textit{re-training}$.
comment: Accepted in the International Conference on Learning Representations (ICLR) 2024
☆ LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
The machine learning community has witnessed impressive advancements since the first appearance of large language models (LLMs), yet their huge memory consumption has become a major roadblock to large-scale training. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem, but their performance still fails to match full parameter training in most large-scale fine-tuning settings. Attempting to complement this deficiency, we investigate layerwise properties of LoRA on fine-tuning tasks and observe an uncommon skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freeze most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over $11\%$-$37\%$ in terms of MT-Bench scores. On large models, specifically LLaMA-2-70B, LISA achieves on-par or better performance than LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.
☆ CMP: Cooperative Motion Prediction with Multi-Agent Communication
The confluence of the advancement of Autonomous Vehicles (AVs) and the maturity of Vehicle-to-Everything (V2X) communication has enabled the capability of cooperative connected and automated vehicles (CAVs). Building on top of cooperative perception, this paper explores the feasibility and effectiveness of cooperative motion prediction. Our method, CMP, takes LiDAR signals as input to enhance tracking and prediction capabilities. Unlike previous work that focuses separately on either cooperative perception or motion prediction, our framework, to the best of our knowledge, is the first to address the unified problem where CAVs share information in both perception and prediction modules. Incorporated into our design is the unique capability to tolerate realistic V2X bandwidth limitations and transmission delays, while dealing with bulky perception representations. We also propose a prediction aggregation module, which unifies the predictions obtained by different CAVs and generates the final prediction. Through extensive experiments and ablation studies, we demonstrate the effectiveness of our method in cooperative perception, tracking, and motion prediction tasks. In particular, CMP reduces the average prediction error by 17.2\% with fewer missing detections compared with the no cooperation setting. Our work marks a significant step forward in the cooperative capabilities of CAVs, showcasing enhanced performance in complex scenarios.
☆ Scalable Non-Cartesian Magnetic Resonance Imaging with R2D2
We propose a new approach for non-Cartesian magnetic resonance image reconstruction. While unrolled architectures provide robustness via data-consistency layers, embedding measurement operators in Deep Neural Network (DNN) can become impractical at large scale. Alternative Plug-and-Play (PnP) approaches, where the denoising DNNs are blind to the measurement setting, are not affected by this limitation and have also proven effective, but their highly iterative nature also affects scalability. To address this scalability challenge, we leverage the "Residual-to-Residual DNN series for high-Dynamic range imaging (R2D2)" approach recently introduced in astronomical imaging. R2D2's reconstruction is formed as a series of residual images, iteratively estimated as outputs of DNNs taking the previous iteration's image estimate and associated data residual as inputs. The method can be interpreted as a learned version of the Matching Pursuit algorithm. We demonstrate R2D2 in simulation, considering radial k-space sampling acquisition sequences. Our preliminary results suggest that R2D2 achieves: (i) suboptimal performance compared to its unrolled incarnation R2D2-Net, which is however non-scalable due to the necessary embedding of NUFFT-based data-consistency layers; (ii) superior reconstruction quality to a scalable version of R2D2-Net embedding an FFT-based approximation for data consistency; (iii) superior reconstruction quality to PnP, while only requiring few iterations.
comment: submitted to IEEE EUSIPCO 2024
☆ Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models
The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters are inherently local and therefore struggle at modeling long-range dependencies in images. On the other hand, attention excels at capturing global interactions between arbitrary image regions, however at a quadratic cost in image dimension. In this work, we propose Serpent, an architecture that leverages recent advances in state space models (SSMs) in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. Our preliminary results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size.
comment: 7 pages, 5 figures, preliminary workshop submission of a comprehensive work to be released soon
☆ Image-based Novel Fault Detection with Deep Learning Classifiers using Hierarchical Labels
One important characteristic of modern fault classification systems is the ability to flag the system when faced with previously unseen fault types. This work considers the unknown fault detection capabilities of deep neural network-based fault classifiers. Specifically, we propose a methodology on how, when available, labels regarding the fault taxonomy can be used to increase unknown fault detection performance without sacrificing model performance. To achieve this, we propose to utilize soft label techniques to improve the state-of-the-art deep novel fault detection techniques during the training process and novel hierarchically consistent detection statistics for online novel fault detection. Finally, we demonstrated increased detection performance on novel fault detection in inspection images from the hot steel rolling process, with results well replicated across multiple scenarios and baseline detection methods.
comment: Accepted in IISE Transaction
☆ Large scale paired antibody language models
Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.
comment: 14 pages, 2 figures, 6 tables, model weights available at https://zenodo.org/doi/10.5281/zenodo.10876908
☆ The Unreasonable Ineffectiveness of the Deeper Layers
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
comment: 12 + 10 pages, 5 + 4 figures
☆ Compressed Multi-task embeddings for Data-Efficient Downstream training and inference in Earth Observation
As repositories of large scale data in earth observation (EO) have grown, so have transfer and storage costs for model training and inference, expending significant resources. We introduce Neural Embedding Compression (NEC), based on the transfer of compressed embeddings to data consumers instead of raw data. We adapt foundation models (FM) through learned neural compression to generate multi-task embeddings while navigating the tradeoff between compression rate and embedding utility. We update only a small fraction of the FM parameters (10%) for a short training period (1% of the iterations of pre-training). We evaluate NEC on two EO tasks: scene classification and semantic segmentation. Compared with applying traditional compression to the raw data, NEC achieves similar accuracy with a 75% to 90% reduction in data. Even at 99.7% compression, performance drops by only 5% on the scene classification task. Overall, NEC is a data-efficient yet performant approach for multi-task EO modelling.
comment: Published at IGARSS 2024
☆ Empowering Data Mesh with Federated Learning KDD
The evolution of data architecture has seen the rise of data lakes, aiming to solve the bottlenecks of data management and promote intelligent decision-making. However, this centralized architecture is limited by the proliferation of data sources and the growing demand for timely analysis and processing. A new data paradigm, Data Mesh, is proposed to overcome these challenges. Data Mesh treats domains as a first-class concern by distributing the data ownership from the central team to each data domain, while keeping the federated governance to monitor domains and their data products. Many multi-million dollar organizations like Paypal, Netflix, and Zalando have already transformed their data analysis pipelines based on this new architecture. In this decentralized architecture where data is locally preserved by each domain team, traditional centralized machine learning is incapable of conducting effective analysis across multiple domains, especially for security-sensitive organizations. To this end, we introduce a pioneering approach that incorporates Federated Learning into Data Mesh. To the best of our knowledge, this is the first open-source applied work that represents a critical advancement toward the integration of federated learning methods into the Data Mesh paradigm, underscoring the promising prospects for privacy-preserving and decentralized data analysis strategies within Data Mesh architecture.
comment: In Proceedings of ACM Knowledge Discovery and Data Mining, Barcelona, Spain, 25th - 29th August, 2024 (Conference acronym KDD), 9 pages
☆ Sample complexity of quantum hypothesis testing
Quantum hypothesis testing has been traditionally studied from the information-theoretic perspective, wherein one is interested in the optimal decay rate of error probabilities as a function of the number of samples of an unknown state. In this paper, we study the sample complexity of quantum hypothesis testing, wherein the goal is to determine the minimum number of samples needed to reach a desired error probability. By making use of the wealth of knowledge that already exists in the literature on quantum hypothesis testing, we characterize the sample complexity of binary quantum hypothesis testing in the symmetric and asymmetric settings, and we provide bounds on the sample complexity of multiple quantum hypothesis testing. In more detail, we prove that the sample complexity of symmetric binary quantum hypothesis testing depends logarithmically on the inverse error probability and inversely on the negative logarithm of the fidelity. As a counterpart of the quantum Stein's lemma, we also find that the sample complexity of asymmetric binary quantum hypothesis testing depends logarithmically on the inverse type~II error probability and inversely on the quantum relative entropy. Finally, we provide lower and upper bounds on the sample complexity of multiple quantum hypothesis testing, with it remaining an intriguing open question to improve these bounds.
comment: 38 pages, 1 figure, preliminary version; see independent and concurrent work of Pensia, Jog, Loh at arXiv:2403.16981
☆ Using Domain Knowledge to Guide Dialog Structure Induction via Neural Probabilistic Soft Logic
Dialog Structure Induction (DSI) is the task of inferring the latent dialog structure (i.e., a set of dialog states and their temporal transitions) of a given goal-oriented dialog. It is a critical component for modern dialog system design and discourse analysis. Existing DSI approaches are often purely data-driven, deploy models that infer latent states without access to domain knowledge, underperform when the training corpus is limited/noisy, or have difficulty when test dialogs exhibit distributional shifts from the training domain. This work explores a neural-symbolic approach as a potential solution to these problems. We introduce Neural Probabilistic Soft Logic Dialogue Structure Induction (NEUPSL DSI), a principled approach that injects symbolic knowledge into the latent space of a generative neural model. We conduct a thorough empirical investigation on the effect of NEUPSL DSI learning on hidden representation quality, few-shot learning, and out-of-domain generalization performance. Over three dialog structure induction datasets and across unsupervised and semi-supervised settings for standard and cross-domain generalization, the injection of symbolic knowledge using NEUPSL DSI provides a consistent boost in performance over the canonical baselines.
☆ Counterfactual Fairness through Transforming Data Orthogonal to Bias
Machine learning models have shown exceptional prowess in solving complex issues across various domains. Nonetheless, these models can sometimes exhibit biased decision-making, leading to disparities in treatment across different groups. Despite the extensive research on fairness, the nuanced effects of multivariate and continuous sensitive variables on decision-making outcomes remain insufficiently studied. We introduce a novel data pre-processing algorithm, Orthogonal to Bias (OB), designed to remove the influence of a group of continuous sensitive variables, thereby facilitating counterfactual fairness in machine learning applications. Our approach is grounded in the assumption of a jointly normal distribution within a structural causal model (SCM), proving that counterfactual fairness can be achieved by ensuring the data is uncorrelated with sensitive variables. The OB algorithm is model-agnostic, catering to a wide array of machine learning models and tasks, and includes a sparse variant to enhance numerical stability through regularization. Through empirical evaluation on simulated and real-world datasets - including the adult income and the COMPAS recidivism datasets - our methodology demonstrates its capacity to enable fairer outcomes without compromising accuracy.
☆ Climate Downscaling: A Deep-Learning Based Super-resolution Model of Precipitation Data with Attention Block and Skip Connections
Human activities accelerate consumption of fossil fuels and produce greenhouse gases, resulting in urgent issues today: global warming and the climate change. These indirectly cause severe natural disasters, plenty of lives suffering and huge losses of agricultural properties. To mitigate impacts on our lands, scientists are developing renewable, reusable, and clean energies and climatologists are trying to predict the extremes. Meanwhile, governments are publicizing resource-saving policies for a more eco-friendly society and arousing environment awareness. One of the most influencing factors is the precipitation, bringing condensed water vapor onto lands. Water resources are the most significant but basic needs in society, not only supporting our livings, but also economics. In Taiwan, although the average annual precipitation is up to 2,500 millimeter (mm), the water allocation for each person is lower than the global average due to drastically geographical elevation changes and uneven distribution through the year. Thus, it is crucial to track and predict the rainfall to make the most use of it and to prevent the floods. However, climate models have limited resolution and require intensive computational power for local-scale use. Therefore, we proposed a deep convolutional neural network with skip connections, attention blocks, and auxiliary data concatenation, in order to downscale the low-resolution precipitation data into high-resolution one. Eventually, we compare with other climate downscaling methods and show better performance in metrics of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson Correlation, structural similarity index (SSIM), and forecast indicators.
☆ Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/.
comment: Code and video are available at http://hovsg.github.io/
☆ TractOracle: towards an anatomically-informed reward function for RL-based tractography
Reinforcement learning (RL)-based tractography is a competitive alternative to machine learning and classical tractography algorithms due to its high anatomical accuracy obtained without the need for any annotated data. However, the reward functions so far used to train RL agents do not encapsulate anatomical knowledge which causes agents to generate spurious false positives tracts. In this paper, we propose a new RL tractography system, TractOracle, which relies on a reward network trained for streamline classification. This network is used both as a reward function during training as well as a mean for stopping the tracking process early and thus reduce the number of false positive streamlines. This makes our system a unique method that evaluates and reconstructs WM streamlines at the same time. We report an improvement of true positive ratios by almost 20\% and a reduction of 3x of false positive ratios on one dataset and an increase between 2x and 7x in the number true positive streamlines on another dataset.
☆ Mechanistic Design and Scaling of Hybrid Architectures
The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this process by grounding it in an end-to-end mechanistic architecture design (MAD) pipeline, encompassing small-scale capability unit tests predictive of scaling laws. Through a suite of synthetic token manipulation tasks such as compression and recall, designed to probe capabilities, we identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis, training over 500 language models between 70M to 7B parameters. Surprisingly, we find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures via isolated proxy tasks. The new architectures found via MAD, based on simple ideas such as hybridization and sparsity, outperform state-of-the-art Transformer, convolutional, and recurrent architectures (Transformer++, Hyena, Mamba) in scaling, both at compute-optimal budgets and in overtrained regimes. Overall, these results provide evidence that performance on curated synthetic tasks can be predictive of scaling laws, and that an optimal architecture should leverage specialized layers via a hybrid topology.
☆ GTA-HDR: A Large-Scale Synthetic Dataset for HDR Image Reconstruction
High Dynamic Range (HDR) content (i.e., images and videos) has a broad range of applications. However, capturing HDR content from real-world scenes is expensive and time- consuming. Therefore, the challenging task of reconstructing visually accurate HDR images from their Low Dynamic Range (LDR) counterparts is gaining attention in the vision research community. A major challenge in this research problem is the lack of datasets, which capture diverse scene conditions (e.g., lighting, shadows, weather, locations, landscapes, objects, humans, buildings) and various image features (e.g., color, contrast, saturation, hue, luminance, brightness, radiance). To address this gap, in this paper, we introduce GTA-HDR, a large-scale synthetic dataset of photo-realistic HDR images sampled from the GTA-V video game. We perform thorough evaluation of the proposed dataset, which demonstrates significant qualitative and quantitative improvements of the state-of-the-art HDR image reconstruction methods. Furthermore, we demonstrate the effectiveness of the proposed dataset and its impact on additional computer vision tasks including 3D human pose estimation, human body part segmentation, and holistic scene segmentation. The dataset, data collection pipeline, and evaluation code are available at: https://github.com/HrishavBakulBarua/GTA-HDR.
comment: Submitted to IEEE
☆ GPFL: A Gradient Projection-Based Client Selection Framework for Efficient Federated Learning
Federated learning client selection is crucial for determining participant clients while balancing model accuracy and communication efficiency. Existing methods have limitations in handling data heterogeneity, computational burdens, and independent client treatment. To address these challenges, we propose GPFL, which measures client value by comparing local and global descent directions. We also employ an Exploit-Explore mechanism to enhance performance. Experimental results on FEMINST and CIFAR-10 datasets demonstrate that GPFL outperforms baselines in Non-IID scenarios, achieving over 9\% improvement in FEMINST test accuracy. Moreover, GPFL exhibits shorter computation times through pre-selection and parameter reuse in federated learning.
comment: 8 pages, 5 figures
☆ Learning the Optimal Power Flow: Environment Design Matters
To solve the optimal power flow (OPF) problem, reinforcement learning (RL) emerges as a promising new approach. However, the RL-OPF literature is strongly divided regarding the exact formulation of the OPF problem as an RL environment. In this work, we collect and implement diverse environment design decisions from the literature regarding training data, observation space, episode definition, and reward function choice. In an experimental analysis, we show the significant impact of these environment design options on RL-OPF training performance. Further, we derive some first recommendations regarding the choice of these design decisions. The created environment framework is fully open-source and can serve as a benchmark for future research in the RL-OPF field.
☆ DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and a text-based interaction stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the interaction phase. For textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the interaction stage.
comment: Project Page: https://diffh2o.github.io/
☆ Are Compressed Language Models Less Subgroup Robust? EMNLP 2023
To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression.
comment: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
☆ Annotated Biomedical Video Generation using Denoising Diffusion Probabilistic Models and Flow Fields
The segmentation and tracking of living cells play a vital role within the biomedical domain, particularly in cancer research, drug development, and developmental biology. These are usually tedious and time-consuming tasks that are traditionally done by biomedical experts. Recently, to automatize these processes, deep learning based segmentation and tracking methods have been proposed. These methods require large-scale datasets and their full potential is constrained by the scarcity of annotated data in the biomedical imaging domain. To address this limitation, we propose Biomedical Video Diffusion Model (BVDM), capable of generating realistic-looking synthetic microscopy videos. Trained only on a single real video, BVDM can generate videos of arbitrary length with pixel-level annotations that can be used for training data-hungry models. It is composed of a denoising diffusion probabilistic model (DDPM) generating high-fidelity synthetic cell microscopy images and a flow prediction model (FPM) predicting the non-rigid transformation between consecutive video frames. During inference, initially, the DDPM imposes realistic cell textures on synthetic cell masks which are generated based on real data statistics. The flow prediction model predicts the flow field between consecutive masks and applies that to the DDPM output from the previous time frame to create the next one while keeping temporal consistency. BVDM outperforms state-of-the-art synthetic live cell microscopy video generation models. Furthermore, we demonstrate that a sufficiently large synthetic dataset enhances the performance of cell segmentation and tracking models compared to using a limited amount of available real data.
☆ Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Many recent language model (LM) interpretability studies have adopted the circuits framework, which aims to find the minimal computational subgraph, or circuit, that explains LM behavior on a given task. Most studies determine which edges belong in a LM's circuit by performing causal interventions on each edge independently, but this scales poorly with model size. Edge attribution patching (EAP), gradient-based approximation to interventions, has emerged as a scalable but imperfect solution to this problem. In this paper, we introduce a new method - EAP with integrated gradients (EAP-IG) - that aims to better maintain a core property of circuits: faithfulness. A circuit is faithful if all model edges outside the circuit can be ablated without changing the model's performance on the task; faithfulness is what justifies studying circuits, rather than the full model. Our experiments demonstrate that circuits found using EAP are less faithful than those found using EAP-IG, even though both have high node overlap with circuits found previously using causal interventions. We conclude more generally that when using circuits to compare the mechanisms models use to solve tasks, faithfulness, not overlap, is what should be measured.
☆ Scenario-Based Curriculum Generation for Multi-Agent Autonomous Driving
The automated generation of diverse and complex training scenarios has been an important ingredient in many complex learning tasks. Especially in real-world application domains, such as autonomous driving, auto-curriculum generation is considered vital for obtaining robust and general policies. However, crafting traffic scenarios with multiple, heterogeneous agents is typically considered as a tedious and time-consuming task, especially in more complex simulation environments. In our work, we introduce MATS-Gym, a Multi-Agent Traffic Scenario framework to train agents in CARLA, a high-fidelity driving simulator. MATS-Gym is a multi-agent training framework for autonomous driving that uses partial scenario specifications to generate traffic scenarios with variable numbers of agents. This paper unifies various existing approaches to traffic scenario description into a single training framework and demonstrates how it can be integrated with techniques from unsupervised environment design to automate the generation of adaptive auto-curricula. The code is available at https://github.com/AutonomousDrivingExaminer/mats-gym.
comment: 7 Pages, Under Review
☆ Secure Aggregation is Not Private Against Membership Inference Attacks
Secure aggregation (SecAgg) is a commonly-used privacy-enhancing mechanism in federated learning, affording the server access only to the aggregate of model updates while safeguarding the confidentiality of individual updates. Despite widespread claims regarding SecAgg's privacy-preserving capabilities, a formal analysis of its privacy is lacking, making such presumptions unjustified. In this paper, we delve into the privacy implications of SecAgg by treating it as a local differential privacy (LDP) mechanism for each local update. We design a simple attack wherein an adversarial server seeks to discern which update vector a client submitted, out of two possible ones, in a single training round of federated learning under SecAgg. By conducting privacy auditing, we assess the success probability of this attack and quantify the LDP guarantees provided by SecAgg. Our numerical results unveil that, contrary to prevailing claims, SecAgg offers weak privacy against membership inference attacks even in a single training round. Indeed, it is difficult to hide a local update by adding other independent local updates when the updates are of high dimension. Our findings underscore the imperative for additional privacy-enhancing mechanisms, such as noise injection, in federated learning.
☆ SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation LREC
Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.
comment: LREC-COLING 2024 Main Conference Paper
☆ Asymptotic Bayes risk of semi-supervised learning with uncertain labeling
This article considers a semi-supervised classification setting on a Gaussian mixture model, where the data is not labeled strictly as usual, but instead with uncertain labels. Our main aim is to compute the Bayes risk for this model. We compare the behavior of the Bayes risk and the best known algorithm for this model. This comparison eventually gives new insights over the algorithm.
☆ Noise2Noise Denoising of CRISM Hyperspectral Data ICLR 2024
Hyperspectral data acquired by the Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) have allowed for unparalleled mapping of the surface mineralogy of Mars. Due to sensor degradation over time, a significant portion of the recently acquired data is considered unusable. Here a new data-driven model architecture, Noise2Noise4Mars (N2N4M), is introduced to remove noise from CRISM images. Our model is self-supervised and does not require zero-noise target data, making it well suited for use in Planetary Science applications where high quality labelled data is scarce. We demonstrate its strong performance on synthetic-noise data and CRISM images, and its impact on downstream classification performance, outperforming benchmark methods on most metrics. This allows for detailed analysis for critical sites of interest on the Martian surface, including proposed lander sites.
comment: 5 pages, 3 figures. Accepted as a conference paper at the ICLR 2024 ML4RS Workshop
☆ CCDSReFormer: Traffic Flow Prediction with a Criss-Crossed Dual-Stream Enhanced Rectified Transformer Model
Accurate, and effective traffic forecasting is vital for smart traffic systems, crucial in urban traffic planning and management. Current Spatio-Temporal Transformer models, despite their prediction capabilities, struggle with balancing computational efficiency and accuracy, favoring global over local information, and handling spatial and temporal data separately, limiting insight into complex interactions. We introduce the Criss-Crossed Dual-Stream Enhanced Rectified Transformer model (CCDSReFormer), which includes three innovative modules: Enhanced Rectified Spatial Self-attention (ReSSA), Enhanced Rectified Delay Aware Self-attention (ReDASA), and Enhanced Rectified Temporal Self-attention (ReTSA). These modules aim to lower computational needs via sparse attention, focus on local information for better traffic dynamics understanding, and merge spatial and temporal insights through a unique learning method. Extensive tests on six real-world datasets highlight CCDSReFormer's superior performance. An ablation study also confirms the significant impact of each component on the model's predictive accuracy, showcasing our model's ability to forecast traffic flow effectively.
comment: 18 pages
☆ Leave No Patient Behind: Enhancing Medication Recommendation for Rare Disease Patients
Medication recommendation systems have gained significant attention in healthcare as a means of providing tailored and effective drug combinations based on patients' clinical information. However, existing approaches often suffer from fairness issues, as recommendations tend to be more accurate for patients with common diseases compared to those with rare conditions. In this paper, we propose a novel model called Robust and Accurate REcommendations for Medication (RAREMed), which leverages the pretrain-finetune learning paradigm to enhance accuracy for rare diseases. RAREMed employs a transformer encoder with a unified input sequence approach to capture complex relationships among disease and procedure codes. Additionally, it introduces two self-supervised pre-training tasks, namely Sequence Matching Prediction (SMP) and Self Reconstruction (SR), to learn specialized medication needs and interrelations among clinical codes. Experimental results on two real-world datasets demonstrate that RAREMed provides accurate drug sets for both rare and common disease patients, thereby mitigating unfairness in medication recommendation systems.
☆ EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention SIGIR'24
To capture user preference, transformer models have been widely applied to model sequential user behavior data. The core of transformer architecture lies in the self-attention mechanism, which computes the pairwise attention scores in a sequence. Due to the permutation-equivariant nature, positional encoding is used to enhance the attention between token representations. In this setting, the pairwise attention scores can be derived by both semantic difference and positional difference. However, prior studies often model the two kinds of difference measurements in different ways, which potentially limits the expressive capacity of sequence modeling. To address this issue, this paper proposes a novel transformer variant with complex vector attention, named EulerFormer, which provides a unified theoretical framework to formulate both semantic difference and positional difference. The EulerFormer involves two key technical improvements. First, it employs a new transformation function for efficiently transforming the sequence tokens into polar-form complex vectors using Euler's formula, enabling the unified modeling of both semantic and positional information in a complex rotation form.Secondly, it develops a differential rotation mechanism, where the semantic rotation angles can be controlled by an adaptation function, enabling the adaptive integration of the semantic and positional information according to the semantic contexts.Furthermore, a phase contrastive learning task is proposed to improve the anisotropy of contextual representations in EulerFormer. Our theoretical framework possesses a high degree of completeness and generality. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle. Extensive experiments conducted on four public datasets demonstrate the effectiveness and efficiency of our approach.
comment: Accepted for publication in SIGIR'24
☆ Masked Autoencoders are PDE Learners
Neural solvers for partial differential equations (PDEs) have great potential, yet their practicality is currently limited by their generalizability. PDEs evolve over broad scales and exhibit diverse behaviors; predicting these phenomena will require learning representations across a wide variety of inputs, which may encompass different coefficients, geometries, or equations. As a step towards generalizable PDE modeling, we adapt masked pretraining for PDEs. Through self-supervised learning across PDEs, masked autoencoders can learn useful latent representations for downstream tasks. In particular, masked pretraining can improve coefficient regression and timestepping performance of neural solvers on unseen equations. We hope that masked pretraining can emerge as a unifying method across large, unlabeled, and heterogeneous datasets to learn latent physics at scale.
comment: 10 pages, 3 figures
☆ Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation
Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its variants, have demonstrated notable performance in the field of vision. However, their feature extraction methods may not be sufficiently effective and retain some redundant structures, leaving room for parameter reduction. Motivated by previous spatial and channel attention methods, we propose Triplet Mamba-UNet. The method leverages residual VSS Blocks to extract intensive contextual features, while Triplet SSM is employed to fuse features across spatial and channel dimensions. We conducted experiments on ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, and Kvasir-Instrument datasets, demonstrating the superior segmentation performance of our proposed TM-UNet. Additionally, compared to the previous VM-UNet, our model achieves a one-third reduction in parameters.
☆ MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation
When the predicted sequence length exceeds the length seen during training, the transformer's inference accuracy diminishes. Existing relative position encoding methods, such as those based on the ALiBi technique, address the length extrapolation challenge exclusively through the implementation of a single kernel function, which introduces a constant bias to every post-softmax attention scores according to their distance. These approaches do not investigate or employ multiple kernel functions to address the extrapolation challenge. Drawing on the ALiBi approach, this study proposes a novel relative positional encoding method, called MEP, which employs a weighted average to combine distinct kernel functions(such as the exponential kernel and the Gaussian kernel) to generate a bias that is applied to post-softmax attention scores. Initially, the framework utilizes various kernel functions to construct multiple kernel functions. Each kernel function adheres to a consistent mean weight coefficient, harnessing the synergistic advantages of different kernels to formulate an innovative bias function. Subsequently, specific slopes are tailored for each kernel function, applying penalties at varying rates, to enhance the model's extrapolation capabilities. Finally, this bias is seamlessly incorporated as a penalty to the post-softmax scores. We present two distinct versions of our method: a parameter-free variant that requires no new learnable parameters, which enhances length extrapolation capabilities without compromising training efficiency, and a parameterized variant capable of integrating state-of-the-art techniques. Empirical evaluations across diverse datasets have demonstrated that both variants of our method achieve state-of-the-art performance, outperforming traditional parameter-free and parameterized approaches.
☆ PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition
We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba
☆ Manifold-Guided Lyapunov Control with Diffusion Models
This paper presents a novel approach to generating stabilizing controllers for a large class of dynamical systems using diffusion models. The core objective is to develop stabilizing control functions by identifying the closest asymptotically stable vector field relative to a predetermined manifold and adjusting the control function based on this finding. To achieve this, we employ a diffusion model trained on pairs consisting of asymptotically stable vector fields and their corresponding Lyapunov functions. Our numerical results demonstrate that this pre-trained model can achieve stabilization over previously unseen systems efficiently and rapidly, showcasing the potential of our approach in fast zero-shot control and generalizability.
comment: 14 pages
☆ How Private is DP-SGD?
We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling based DP-SGD is more commonly used in practical implementations, it is neither analytically nor numerically amenable to easy privacy analysis. On the other hand, Poisson subsampling based DP-SGD is challenging to scalably implement, but has a well-understood privacy analysis, with multiple open-source numerically tight privacy accountants available. This has led to a common practice of using shuffling based DP-SGD in practice, but using the privacy analysis for the corresponding Poisson subsampling version. Our result shows that there can be a substantial gap between the privacy analysis when using the two types of batch sampling, and thus advises caution in reporting privacy parameters for DP-SGD.
☆ CANOS: A Fast and Scalable Neural AC-OPF Solver Robust To N-1 Perturbations
Optimal Power Flow (OPF) refers to a wide range of related optimization problems with the goal of operating power systems efficiently and securely. In the simplest setting, OPF determines how much power to generate in order to minimize costs while meeting demand for power and satisfying physical and operational constraints. In even the simplest case, power grid operators use approximations of the AC-OPF problem because solving the exact problem is prohibitively slow with state-of-the-art solvers. These approximations sacrifice accuracy and operational feasibility in favor of speed. This trade-off leads to costly "uplift payments" and increased carbon emissions, especially for large power grids. In the present work, we train a deep learning system (CANOS) to predict near-optimal solutions (within 1% of the true AC-OPF cost) without compromising speed (running in as little as 33--65 ms). Importantly, CANOS scales to realistic grid sizes with promising empirical results on grids containing as many as 10,000 buses. Finally, because CANOS is a Graph Neural Network, it is robust to changes in topology. We show that CANOS is accurate across N-1 topological perturbations of a base grid typically used in security-constrained analysis. This paves the way for more efficient optimization of more complex OPF problems which alter grid connectivity such as unit commitment, topology optimization and security-constrained OPF.
☆ SGHormer: An Energy-Saving Graph Transformer Driven by Spikes
Graph Transformers (GTs) with powerful representation learning ability make a huge success in wide range of graph tasks. However, the costs behind outstanding performances of GTs are higher energy consumption and computational overhead. The complex structure and quadratic complexity during attention calculation in vanilla transformer seriously hinder its scalability on the large-scale graph data. Though existing methods have made strides in simplifying combinations among blocks or attention-learning paradigm to improve GTs' efficiency, a series of energy-saving solutions originated from biologically plausible structures are rarely taken into consideration when constructing GT framework. To this end, we propose a new spiking-based graph transformer (SGHormer). It turns full-precision embeddings into sparse and binarized spikes to reduce memory and computational costs. The spiking graph self-attention and spiking rectify blocks in SGHormer explicitly capture global structure information and recover the expressive power of spiking embeddings, respectively. In experiments, SGHormer achieves comparable performances to other full-precision GTs with extremely low computational energy consumption. The results show that SGHomer makes a remarkable progress in the field of low-energy GTs.
comment: 9 pages, 3 figures
☆ Uncertainty-aware Distributional Offline Reinforcement Learning
Offline reinforcement learning (RL) presents distinct challenges as it relies solely on observational data. A central concern in this context is ensuring the safety of the learned policy by quantifying uncertainties associated with various actions and environmental stochasticity. Traditional approaches primarily emphasize mitigating epistemic uncertainty by learning risk-averse policies, often overlooking environmental stochasticity. In this study, we propose an uncertainty-aware distributional offline RL method to simultaneously address both epistemic uncertainty and environmental stochasticity. We propose a model-free offline RL algorithm capable of learning risk-averse policies and characterizing the entire distribution of discounted cumulative rewards, as opposed to merely maximizing the expected value of accumulated discounted returns. Our method is rigorously evaluated through comprehensive experiments in both risk-sensitive and risk-neutral benchmarks, demonstrating its superior performance.
☆ PeersimGym: An Environment for Solving the Task Offloading Problem with Reinforcement Learning
Task offloading, crucial for balancing computational loads across devices in networks such as the Internet of Things, poses significant optimization challenges, including minimizing latency and energy usage under strict communication and storage constraints. While traditional optimization falls short in scalability; and heuristic approaches lack in achieving optimal outcomes, Reinforcement Learning (RL) offers a promising avenue by enabling the learning of optimal offloading strategies through iterative interactions. However, the efficacy of RL hinges on access to rich datasets and custom-tailored, realistic training environments. To address this, we introduce PeersimGym, an open-source, customizable simulation environment tailored for developing and optimizing task offloading strategies within computational networks. PeersimGym supports a wide range of network topologies and computational constraints and integrates a \textit{PettingZoo}-based interface for RL agent deployment in both solo and multi-agent setups. Furthermore, we demonstrate the utility of the environment through experiments with Deep Reinforcement Learning agents, showcasing the potential of RL-based approaches to significantly enhance offloading strategies in distributed computing settings. PeersimGym thus bridges the gap between theoretical RL models and their practical applications, paving the way for advancements in efficient task offloading methodologies.
☆ Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems
Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications, from e-commerce platforms to streaming services. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets within the RL framework. Recent advancements in offline RLRS provide a solution for how to address these two challenges. However, existing methods mainly rely on the transformer architecture, which, as sequence lengths increase, can introduce challenges associated with computational resources and training costs. Additionally, the prevalent methods employ fixed-length input trajectories, restricting their capacity to capture evolving user preferences. In this study, we introduce a new offline RLRS method to deal with the above problems. We reinterpret the RLRS challenge by modeling sequential decision-making as an inference task, leveraging adaptive masking configurations. This adaptive approach selectively masks input tokens, transforming the recommendation task into an inference challenge based on varying token subsets, thereby enhancing the agent's ability to infer across diverse trajectory lengths. Furthermore, we incorporate a multi-scale segmented retention mechanism that facilitates efficient modeling of long sequences, significantly enhancing computational efficiency. Our experimental analysis, conducted on both online simulator and offline datasets, clearly demonstrates the advantages of our proposed method.
☆ Data-driven Energy Consumption Modelling for Electric Micromobility using an Open Dataset
The escalating challenges of traffic congestion and environmental degradation underscore the critical importance of embracing E-Mobility solutions in urban spaces. In particular, micro E-Mobility tools such as E-scooters and E-bikes, play a pivotal role in this transition, offering sustainable alternatives for urban commuters. However, the energy consumption patterns for these tools are a critical aspect that impacts their effectiveness in real-world scenarios and is essential for trip planning and boosting user confidence in using these. To this effect, recent studies have utilised physical models customised for specific mobility tools and conditions, but these models struggle with generalization and effectiveness in real-world scenarios due to a notable absence of open datasets for thorough model evaluation and verification. To fill this gap, our work presents an open dataset, collected in Dublin, Ireland, specifically designed for energy modelling research related to E-Scooters and E-Bikes. Furthermore, we provide a comprehensive analysis of energy consumption modelling based on the dataset using a set of representative machine learning algorithms and compare their performance against the contemporary mathematical models as a baseline. Our results demonstrate a notable advantage for data-driven models in comparison to the corresponding mathematical models for estimating energy consumption. Specifically, data-driven models outperform physical models in accuracy by up to 83.83% for E-Bikes and 82.16% for E-Scooters based on an in-depth analysis of the dataset under certain assumptions.
comment: 7 pages, 5 figures, 4 tables. This manuscript has been accepted by the IEEE ITEC 2024
☆ Fake or JPEG? Revealing Common Biases in Generated Image Detection Datasets
The widespread adoption of generative image models has highlighted the urgent need to detect artificial content, which is a crucial step in combating widespread manipulation and misinformation. Consequently, numerous detectors and associated datasets have emerged. However, many of these datasets inadvertently introduce undesirable biases, thereby impacting the effectiveness and evaluation of detectors. In this paper, we emphasize that many datasets for AI-generated image detection contain biases related to JPEG compression and image size. Using the GenImage dataset, we demonstrate that detectors indeed learn from these undesired factors. Furthermore, we show that removing the named biases substantially increases robustness to JPEG compression and significantly alters the cross-generator performance of evaluated detectors. Specifically, it leads to more than 11 percentage points increase in cross-generator performance for ResNet50 and Swin-T detectors on the GenImage dataset, achieving state-of-the-art results. We provide the dataset and source codes of this paper on the anonymous website: https://www.unbiased-genimage.org
☆ LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation
Microscopic traffic simulation plays a crucial role in transportation engineering by providing insights into individual vehicle behavior and overall traffic flow. However, creating a realistic simulator that accurately replicates human driving behaviors in various traffic conditions presents significant challenges. Traditional simulators relying on heuristic models often fail to deliver accurate simulations due to the complexity of real-world traffic environments. Due to the covariate shift issue, existing imitation learning-based simulators often fail to generate stable long-term simulations. In this paper, we propose a novel approach called learner-aware supervised imitation learning to address the covariate shift problem in multi-agent imitation learning. By leveraging a variational autoencoder simultaneously modeling the expert and learner state distribution, our approach augments expert states such that the augmented state is aware of learner state distribution. Our method, applied to urban traffic simulation, demonstrates significant improvements over existing state-of-the-art baselines in both short-term microscopic and long-term macroscopic realism when evaluated on the real-world dataset pNEUMA.
comment: accepted by cvpr 2024. arXiv admin note: text overlap with arXiv:2306.06401
☆ On the Benefits of Over-parameterization for Out-of-Distribution Generalization
In recent years, machine learning models have achieved success based on the independently and identically distributed assumption. However, this assumption can be easily violated in real-world applications, leading to the Out-of-Distribution (OOD) problem. Understanding how modern over-parameterized DNNs behave under non-trivial natural distributional shifts is essential, as current theoretical understanding is insufficient. Existing theoretical works often provide meaningless results for over-parameterized models in OOD scenarios or even contradict empirical findings. To this end, we are investigating the performance of the over-parameterized model in terms of OOD generalization under the general benign overfitting conditions. Our analysis focuses on a random feature model and examines non-trivial natural distributional shifts, where the benign overfitting estimators demonstrate a constant excess OOD loss, despite achieving zero excess in-distribution (ID) loss. We demonstrate that in this scenario, further increasing the model's parameterization can significantly reduce the OOD loss. Intuitively, the variance term of ID loss remains low due to orthogonality of long-tail features, meaning overfitting noise during training generally doesn't raise testing loss. However, in OOD cases, distributional shift increases the variance term. Thankfully, the inherent shift is unrelated to individual x, maintaining the orthogonality of long-tail features. Expanding the hidden dimension can additionally improve this orthogonality by mapping the features into higher-dimensional spaces, thereby reducing the variance term. We further show that model ensembles also improve OOD loss, akin to increasing model capacity. These insights explain the empirical phenomenon of enhanced OOD generalization through model ensembles, supported by consistent simulations with theoretical results.
☆ Forest-ORE: Mining Optimal Rule Ensemble to interpret Random Forest models
Random Forest (RF) is well-known as an efficient ensemble learning method in terms of predictive performance. It is also considered a Black Box because of its hundreds of deep decision trees. This lack of interpretability can be a real drawback for acceptance of RF models in several real-world applications, especially those affecting one's lives, such as in healthcare, security, and law. In this work, we present Forest-ORE, a method that makes RF interpretable via an optimized rule ensemble (ORE) for local and global interpretation. Unlike other rule-based approaches aiming at interpreting the RF model, this method simultaneously considers several parameters that influence the choice of an interpretable rule ensemble. Existing methods often prioritize predictive performance over interpretability coverage and do not provide information about existing overlaps or interactions between rules. Forest-ORE uses a mixed-integer optimization program to build an ORE that considers the trade-off between predictive performance, interpretability coverage, and model size (size of the rule ensemble, rule lengths, and rule overlaps). In addition to providing an ORE competitive in predictive performance with RF, this method enriches the ORE through other rules that afford complementary information. It also enables monitoring of the rule selection process and delivers various metrics that can be used to generate a graphical representation of the final model. This framework is illustrated through an example, and its robustness is assessed through 36 benchmark datasets. A comparative analysis of well-known methods shows that Forest-ORE provides an excellent trade-off between predictive performance, interpretability coverage, and model size.
comment: 48 pages, 11 figures
☆ Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models CVPR2024
With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3\% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.
comment: CVPR2024; Codes are available at \url{https://github.com/YBZh/DMN}
☆ Towards a Zero-Data, Controllable, Adaptive Dialog System
Conversational Tree Search (V\"ath et al., 2023) is a recent approach to controllable dialog systems, where domain experts shape the behavior of a Reinforcement Learning agent through a dialog tree. The agent learns to efficiently navigate this tree, while adapting to information needs, e.g., domain familiarity, of different users. However, the need for additional training data hinders deployment in new domains. To address this, we explore approaches to generate this data directly from dialog trees. We improve the original approach, and show that agents trained on synthetic data can achieve comparable dialog success to models trained on human data, both when using a commercial Large Language Model for generation, or when using a smaller open-source model, running on a single GPU. We further demonstrate the scalability of our approach by collecting and testing on two new datasets: ONBOARD, a new domain helping foreign residents moving to a new city, and the medical domain DIAGNOSE, a subset of Wikipedia articles related to scalp and head symptoms. Finally, we perform human testing, where no statistically significant differences were found in either objective or subjective measures between models trained on human and generated data.
☆ Enhancing Privacy in Federated Learning through Local Training
In this paper we propose the federated private local training algorithm (Fed-PLT) for federated learning, to overcome the challenges of (i) expensive communications and (ii) privacy preservation. We address (i) by allowing for both partial participation and local training, which significantly reduce the number of communication rounds between the central coordinator and computing agents. The algorithm matches the state of the art in the sense that the use of local training demonstrably does not impact accuracy. Additionally, agents have the flexibility to choose from various local training solvers, such as (stochastic) gradient descent and accelerated gradient descent. Further, we investigate how employing local training can enhance privacy, addressing point (ii). In particular, we derive differential privacy bounds and highlight their dependence on the number of local training epochs. We assess the effectiveness of the proposed algorithm by comparing it to alternative techniques, considering both theoretical analysis and numerical results from a classification task.
☆ A Survey on Deep Learning and State-of-the-arts Applications
Deep learning, a branch of artificial intelligence, is a computational model that uses multiple layers of interconnected units (neurons) to learn intricate patterns and representations directly from raw input data. Empowered by this learning capability, it has become a powerful tool for solving complex problems and is the core driver of many groundbreaking technologies and innovations. Building a deep learning model is a challenging task due to the algorithm`s complexity and the dynamic nature of real-world problems. Several studies have reviewed deep learning concepts and applications. However, the studies mostly focused on the types of deep learning models and convolutional neural network architectures, offering limited coverage of the state-of-the-art of deep learning models and their applications in solving complex problems across different domains. Therefore, motivated by the limitations, this study aims to comprehensively review the state-of-the-art deep learning models in computer vision, natural language processing, time series analysis and pervasive computing. We highlight the key features of the models and their effectiveness in solving the problems within each domain. Furthermore, this study presents the fundamentals of deep learning, various deep learning model types and prominent convolutional neural network architectures. Finally, challenges and future directions in deep learning research are discussed to offer a broader perspective for future researchers.
☆ DeepMIF: Deep Monotonic Implicit Fields for Large-Scale LiDAR 3D Mapping
Recently, significant progress has been achieved in sensing real large-scale outdoor 3D environments, particularly by using modern acquisition equipment such as LiDAR sensors. Unfortunately, they are fundamentally limited in their ability to produce dense, complete 3D scenes. To address this issue, recent learning-based methods integrate neural implicit representations and optimizable feature grids to approximate surfaces of 3D scenes. However, naively fitting samples along raw LiDAR rays leads to noisy 3D mapping results due to the nature of sparse, conflicting LiDAR measurements. Instead, in this work we depart from fitting LiDAR data exactly, instead letting the network optimize a non-metric monotonic implicit field defined in 3D space. To fit our field, we design a learning system integrating a monotonicity loss that enables optimizing neural monotonic fields and leverages recent progress in large-scale 3D mapping. Our algorithm achieves high-quality dense 3D mapping performance as captured by multiple quantitative and perceptual measures and visual results obtained for Mai City, Newer College, and KITTI benchmarks. The code of our approach will be made publicly available.
comment: 8 pages, 6 figures
☆ VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts
Despite the considerable attention given to the questions of \textit{how much} and \textit{how to} explore in deep reinforcement learning, the investigation into \textit{when} to explore remains relatively less researched. While more sophisticated exploration strategies can excel in specific, often sparse reward environments, existing simpler approaches, such as $\epsilon$-greedy, persist in outperforming them across a broader spectrum of domains. The appeal of these simpler strategies lies in their ease of implementation and generality across a wide range of domains. The downside is that these methods are essentially a blind switching mechanism, which completely disregards the agent's internal state. In this paper, we propose to leverage the agent's internal state to decide \textit{when} to explore, addressing the shortcomings of blind switching mechanisms. We present Value Discrepancy and State Counts through homeostasis (VDSC), a novel approach for efficient exploration timing. Experimental results on the Atari suite demonstrate the superiority of our strategy over traditional methods such as $\epsilon$-greedy and Boltzmann, as well as more sophisticated techniques like Noisy Nets.
☆ BVR Gym: A Reinforcement Learning Environment for Beyond-Visual-Range Air Combat
Creating new air combat tactics and discovering novel maneuvers can require numerous hours of expert pilots' time. Additionally, for each different combat scenario, the same strategies may not work since small changes in equipment performance may drastically change the air combat outcome. For this reason, we created a reinforcement learning environment to help investigate potential air combat tactics in the field of beyond-visual-range (BVR) air combat: the BVR Gym. This type of air combat is important since long-range missiles are often the first weapon to be used in aerial combat. Some existing environments provide high-fidelity simulations but are either not open source or are not adapted to the BVR air combat domain. Other environments are open source but use less accurate simulation models. Our work provides a high-fidelity environment based on the open-source flight dynamics simulator JSBSim and is adapted to the BVR air combat domain. This article describes the building blocks of the environment and some use cases.
comment: 8 pages, 8 figures
☆ Boosting Adversarial Training via Fisher-Rao Norm-based Regularization CVPR2024
Adversarial training is extensively utilized to improve the adversarial robustness of deep neural networks. Yet, mitigating the degradation of standard generalization performance in adversarial-trained models remains an open problem. This paper attempts to resolve this issue through the lens of model complexity. First, We leverage the Fisher-Rao norm, a geometrically invariant metric for model complexity, to establish the non-trivial bounds of the Cross-Entropy Loss-based Rademacher complexity for a ReLU-activated Multi-Layer Perceptron. Then we generalize a complexity-related variable, which is sensitive to the changes in model width and the trade-off factors in adversarial training. Moreover, intensive empirical evidence validates that this variable highly correlates with the generalization gap of Cross-Entropy loss between adversarial-trained and standard-trained models, especially during the initial and final phases of the training process. Building upon this observation, we propose a novel regularization framework, called Logit-Oriented Adversarial Training (LOAT), which can mitigate the trade-off between robustness and accuracy while imposing only a negligible increase in computational overhead. Our extensive experiments demonstrate that the proposed regularization strategy can boost the performance of the prevalent adversarial training algorithms, including PGD-AT, TRADES, TRADES (LSE), MART, and DM-AT, across various network architectures. Our code will be available at https://github.com/TrustAI/LOAT.
comment: This paper has been accepted to CVPR2024
☆ Prediction-sharing During Training and Inference
Two firms are engaged in a competitive prediction task. Each firm has two sources of data -- labeled historical data and unlabeled inference-time data -- and uses the former to derive a prediction model, and the latter to make predictions on new instances. We study data-sharing contracts between the firms. The novelty of our study is to introduce and highlight the differences between contracts that share prediction models only, contracts to share inference-time predictions only, and contracts to share both. Our analysis proceeds on three levels. First, we develop a general Bayesian framework that facilitates our study. Second, we narrow our focus to two natural settings within this framework: (i) a setting in which the accuracy of each firm's prediction model is common knowledge, but the correlation between the respective models is unknown; and (ii) a setting in which two hypotheses exist regarding the optimal predictor, and one of the firms has a structural advantage in deducing it. Within these two settings we study optimal contract choice. More specifically, we find the individually rational and Pareto-optimal contracts for some notable cases, and describe specific settings where each of the different sharing contracts emerge as optimal. Finally, in the third level of our analysis we demonstrate the applicability of our concepts in a synthetic simulation using real loan data.
☆ EL-MLFFs: Ensemble Learning of Machine Leaning Force Fields
Machine learning force fields (MLFFs) have emerged as a promising approach to bridge the accuracy of quantum mechanical methods and the efficiency of classical force fields. However, the abundance of MLFF models and the challenge of accurately predicting atomic forces pose significant obstacles in their practical application. In this paper, we propose a novel ensemble learning framework, EL-MLFFs, which leverages the stacking method to integrate predictions from diverse MLFFs and enhance force prediction accuracy. By constructing a graph representation of molecular structures and employing a graph neural network (GNN) as the meta-model, EL-MLFFs effectively captures atomic interactions and refines force predictions. We evaluate our approach on two distinct datasets: methane molecules and methanol adsorbed on a Cu(100) surface. The results demonstrate that EL-MLFFs significantly improves force prediction accuracy compared to individual MLFFs, with the ensemble of all eight models yielding the best performance. Moreover, our ablation study highlights the crucial roles of the residual network and graph attention layers in the model's architecture. The EL-MLFFs framework offers a promising solution to the challenges of model selection and force prediction accuracy in MLFFs, paving the way for more reliable and efficient molecular simulations.
comment: 12 pages, 3 figures
☆ DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning AAAI 2024
Class-incremental learning (CIL) under an exemplar-free constraint has presented a significant challenge. Existing methods adhering to this constraint are prone to catastrophic forgetting, far more so than replay-based techniques that retain access to past samples. In this paper, to solve the exemplar-free CIL problem, we propose a Dual-Stream Analytic Learning (DS-AL) approach. The DS-AL contains a main stream offering an analytical (i.e., closed-form) linear solution, and a compensation stream improving the inherent under-fitting limitation due to adopting linear mapping. The main stream redefines the CIL problem into a Concatenated Recursive Least Squares (C-RLS) task, allowing an equivalence between the CIL and its joint-learning counterpart. The compensation stream is governed by a Dual-Activation Compensation (DAC) module. This module re-activates the embedding with a different activation function from the main stream one, and seeks fitting compensation by projecting the embedding to the null space of the main stream's linear mapping. Empirical results demonstrate that the DS-AL, despite being an exemplar-free technique, delivers performance comparable with or better than that of replay-based methods across various datasets, including CIFAR-100, ImageNet-100 and ImageNet-Full. Additionally, the C-RLS' equivalent property allows the DS-AL to execute CIL in a phase-invariant manner. This is evidenced by a never-before-seen 500-phase CIL ImageNet task, which performs on a level identical to a 5-phase one. Our codes are available at https://github.com/ZHUANGHP/Analytic-continual-learning.
comment: Accepted in AAAI 2024
☆ Variational Graph Auto-Encoder Based Inductive Learning Method for Semi-Supervised Classification
Graph representation learning is a fundamental research issue in various domains of applications, of which the inductive learning problem is particularly challenging as it requires models to generalize to unseen graph structures during inference. In recent years, graph neural networks (GNNs) have emerged as powerful graph models for inductive learning tasks such as node classification, whereas they typically heavily rely on the annotated nodes under a fully supervised training setting. Compared with the GNN-based methods, variational graph auto-encoders (VGAEs) are known to be more generalizable to capture the internal structural information of graphs independent of node labels and have achieved prominent performance on multiple unsupervised learning tasks. However, so far there is still a lack of work focusing on leveraging the VGAE framework for inductive learning, due to the difficulties in training the model in a supervised manner and avoiding over-fitting the proximity information of graphs. To solve these problems and improve the model performance of VGAEs for inductive graph representation learning, in this work, we propose the Self-Label Augmented VGAE model. To leverage the label information for training, our model takes node labels as one-hot encoded inputs and then performs label reconstruction in model training. To overcome the scarcity problem of node labels for semi-supervised settings, we further propose the Self-Label Augmentation Method (SLAM), which uses pseudo labels generated by our model with a node-wise masking approach to enhance the label information. Experiments on benchmark inductive learning graph datasets verify that our proposed model archives promising results on node classification with particular superiority under semi-supervised learning settings.
☆ Capacity Provisioning Motivated Online Non-Convex Optimization Problem with Memory and Switching Cost
An online non-convex optimization problem is considered where the goal is to minimize the flow time (total delay) of a set of jobs by modulating the number of active servers, but with a switching cost associated with changing the number of active servers over time. Each job can be processed by at most one fixed speed server at any time. Compared to the usual online convex optimization (OCO) problem with switching cost, the objective function considered is non-convex and more importantly, at each time, it depends on all past decisions and not just the present one. Both worst-case and stochastic inputs are considered; for both cases, competitive algorithms are derived.
☆ Natural Language Requirements Testability Measurement Based on Requirement Smells
Requirements form the basis for defining software systems' obligations and tasks. Testable requirements help prevent failures, reduce maintenance costs, and make it easier to perform acceptance tests. However, despite the importance of measuring and quantifying requirements testability, no automatic approach for measuring requirements testability has been proposed based on the requirements smells, which are at odds with the requirements testability. This paper presents a mathematical model to evaluate and rank the natural language requirements testability based on an extensive set of nine requirements smells, detected automatically, and acceptance test efforts determined by requirement length and its application domain. Most of the smells stem from uncountable adjectives, context-sensitive, and ambiguous words. A comprehensive dictionary is required to detect such words. We offer a neural word-embedding technique to generate such a dictionary automatically. Using the dictionary, we could automatically detect Polysemy smell (domain-specific ambiguity) for the first time in 10 application domains. Our empirical study on nearly 1000 software requirements from six well-known industrial and academic projects demonstrates that the proposed smell detection approach outperforms Smella, a state-of-the-art tool, in detecting requirements smells. The precision and recall of smell detection are improved with an average of 0.03 and 0.33, respectively, compared to the state-of-the-art. The proposed requirement testability model measures the testability of 985 requirements with a mean absolute error of 0.12 and a mean squared error of 0.03, demonstrating the model's potential for practical use.
comment: 45 pages, 16 figures, and 13 tables; submitted as a journal paper
☆ A Unified Kernel for Neural Network Learning
Past decades have witnessed a great interest in the distinction and connection between neural network learning and kernel learning. Recent advancements have made theoretical progress in connecting infinite-wide neural networks and Gaussian processes. Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian inference, represents a zero-order kernel, while the latter, grounded in the tangent space of gradient descents, is a first-order kernel. In this paper, we present the Unified Neural Kernel (UNK), which characterizes the learning dynamics of neural networks with gradient descents and parameter initialization. The proposed UNK kernel maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step and converging to NNGP as the learning step approaches infinity. Besides, we also theoretically characterize the uniform tightness and learning convergence of the UNK kernel, providing comprehensive insights into this unified kernel. Experimental results underscore the effectiveness of our proposed method.
☆ Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice
Our paper provides empirical comparisons between recent IDSs to provide an objective comparison between them to help users choose the most appropriate solution based on their requirements. Our results show that no one solution is the best, but is dependent on external variables such as the types of attacks, complexity, and network environment in the dataset. For example, BoT_IoT and Stratosphere IoT datasets both capture IoT-related attacks, but the deep neural network performed the best when tested using the BoT_IoT dataset while HELAD performed the best when tested using the Stratosphere IoT dataset. So although we found that a deep neural network solution had the highest average F1 scores on tested datasets, it is not always the best-performing one. We further discuss difficulties in using IDS from literature and project repositories, which complicated drawing definitive conclusions regarding IDS selection.
comment: 10 pages
☆ Imitating Cost-Constrained Behaviors in Reinforcement Learning ICAPS-24
Complex planning and scheduling problems have long been solved using various optimization or heuristic approaches. In recent years, imitation learning that aims to learn from expert demonstrations has been proposed as a viable alternative to solving these problems. Generally speaking, imitation learning is designed to learn either the reward (or preference) model or directly the behavioral policy by observing the behavior of an expert. Existing work in imitation learning and inverse reinforcement learning has focused on imitation primarily in unconstrained settings (e.g., no limit on fuel consumed by the vehicle). However, in many real-world domains, the behavior of an expert is governed not only by reward (or preference) but also by constraints. For instance, decisions on self-driving delivery vehicles are dependent not only on the route preferences/rewards (depending on past demand data) but also on the fuel in the vehicle and the time available. In such problems, imitation learning is challenging as decisions are not only dictated by the reward model but are also dependent on a cost-constrained model. In this paper, we provide multiple methods that match expert distributions in the presence of trajectory cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to find a good trade-off between expected return and minimizing constraint violation; and (c) Cost-violation-based alternating gradient. We empirically show that leading imitation learning approaches imitate cost-constrained behaviors poorly and our meta-gradient-based approach achieves the best performance.
comment: Accepted to the 34th International Conference on Automated Planning and Scheduling (ICAPS-24)
☆ Chain of Compression: A Systematic Approach to Combinationally Compress Convolutional Neural Networks
Convolutional neural networks (CNNs) have achieved significant popularity, but their computational and memory intensity poses challenges for resource-constrained computing systems, particularly with the prerequisite of real-time performance. To release this burden, model compression has become an important research focus. Many approaches like quantization, pruning, early exit, and knowledge distillation have demonstrated the effect of reducing redundancy in neural networks. Upon closer examination, it becomes apparent that each approach capitalizes on its unique features to compress the neural network, and they can also exhibit complementary behavior when combined. To explore the interactions and reap the benefits from the complementary features, we propose the Chain of Compression, which works on the combinational sequence to apply these common techniques to compress the neural network. Validated on the image-based regression and classification networks across different data sets, our proposed Chain of Compression can significantly compress the computation cost by 100-1000 times with ignorable accuracy loss compared with the baseline model.
comment: 10 pages, 15 figures
☆ Incorporating Exponential Smoothing into MLP: A Simple but Effective Sequence Model
Modeling long-range dependencies in sequential data is a crucial step in sequence learning. A recently developed model, the Structured State Space (S4), demonstrated significant effectiveness in modeling long-range sequences. However, It is unclear whether the success of S4 can be attributed to its intricate parameterization and HiPPO initialization or simply due to State Space Models (SSMs). To further investigate the potential of the deep SSMs, we start with exponential smoothing (ETS), a simple SSM, and propose a stacked architecture by directly incorporating it into an element-wise MLP. We augment simple ETS with additional parameters and complex field to reduce the inductive bias. Despite increasing less than 1\% of parameters of element-wise MLP, our models achieve comparable results to S4 on the LRA benchmark.
comment: 12 pages, 5 tables, 3 figures
☆ Particle identification with machine learning from incomplete data in the ALICE experiment
The ALICE experiment at the LHC measures properties of the strongly interacting matter formed in ultrarelativistic heavy-ion collisions. Such studies require accurate particle identification (PID). ALICE provides PID information via several detectors for particles with momentum from about 100 MeV/c up to 20 GeV/c. Traditionally, particles are selected with rectangular cuts. Acmuch better performance can be achieved with machine learning (ML) methods. Our solution uses multiple neural networks (NN) serving as binary classifiers. Moreover, we extended our particle classifier with Feature Set Embedding and attention in order to train on data with incomplete samples. We also present the integration of the ML project with the ALICE analysis software, and we discuss domain adaptation, the ML technique needed to transfer the knowledge between simulated and real experimental data.
comment: Proceedings of 3rd Artificial Intelligence for the Electron Ion Collider workshop -- AI4EIC2023, 28.11-1.12.2023. Prepared for submission to JINST
☆ Robust and Scalable Model Editing for Large Language Models LREC
Large language models (LLMs) can make predictions using parametric knowledge--knowledge encoded in the model weights--or contextual knowledge--knowledge presented in the context. In many scenarios, a desirable behavior is that LLMs give precedence to contextual knowledge when it conflicts with the parametric knowledge, and fall back to using their parametric knowledge when the context is irrelevant. This enables updating and correcting the model's knowledge by in-context editing instead of retraining. Previous works have shown that LLMs are inclined to ignore contextual knowledge and fail to reliably fall back to parametric knowledge when presented with irrelevant context. In this work, we discover that, with proper prompting methods, instruction-finetuned LLMs can be highly controllable by contextual knowledge and robust to irrelevant context. Utilizing this feature, we propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing. To better evaluate the robustness of model editors, we collect a new dataset, that contains irrelevant questions that are more challenging than the ones in existing datasets. Empirical results show that our method outperforms current state-of-the-art methods by a large margin. Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs (and vice versa). The source code can be found at https://github.com/thunlp/EREN.
comment: LREC-COLING 2024 paper, 16 pages, 4 figures
☆ Masked Multi-Domain Network: Multi-Type and Multi-Scenario Conversion Rate Prediction with a Single Model CIKM 2023
In real-world advertising systems, conversions have different types in nature and ads can be shown in different display scenarios, both of which highly impact the actual conversion rate (CVR). This results in the multi-type and multi-scenario CVR prediction problem. A desired model for this problem should satisfy the following requirements: 1) Accuracy: the model should achieve fine-grained accuracy with respect to any conversion type in any display scenario. 2) Scalability: the model parameter size should be affordable. 3) Convenience: the model should not require a large amount of effort in data partitioning, subset processing and separate storage. Existing approaches cannot simultaneously satisfy these requirements. For example, building a separate model for each (conversion type, display scenario) pair is neither scalable nor convenient. Building a unified model trained on all the data with conversion type and display scenario included as two features is not accurate enough. In this paper, we propose the Masked Multi-domain Network (MMN) to solve this problem. To achieve the accuracy requirement, we model domain-specific parameters and propose a dynamically weighted loss to account for the loss scale imbalance issue within each mini-batch. To achieve the scalability requirement, we propose a parameter sharing and composition strategy to reduce model parameters from a product space to a sum space. To achieve the convenience requirement, we propose an auto-masking strategy which can take mixed data from all the domains as input. It avoids the overhead caused by data partitioning, individual processing and separate storage. Both offline and online experimental results validate the superiority of MMN for multi-type and multi-scenario CVR prediction. MMN is now the serving model for real-time CVR prediction in UC Toutiao.
comment: CIKM 2023 (larger figures)
☆ On permutation-invariant neural networks
Conventional machine learning algorithms have traditionally been designed under the assumption that input data follows a vector-based format, with an emphasis on vector-centric paradigms. However, as the demand for tasks involving set-based inputs has grown, there has been a paradigm shift in the research community towards addressing these challenges. In recent years, the emergence of neural network architectures such as Deep Sets and Transformers has presented a significant advancement in the treatment of set-based data. These architectures are specifically engineered to naturally accommodate sets as input, enabling more effective representation and processing of set structures. Consequently, there has been a surge of research endeavors dedicated to exploring and harnessing the capabilities of these architectures for various tasks involving the approximation of set functions. This comprehensive survey aims to provide an overview of the diverse problem settings and ongoing research efforts pertaining to neural networks that approximate set functions. By delving into the intricacies of these approaches and elucidating the associated challenges, the survey aims to equip readers with a comprehensive understanding of the field. Through this comprehensive perspective, we hope that researchers can gain valuable insights into the potential applications, inherent limitations, and future directions of set-based neural networks. Indeed, from this survey we gain two insights: i) Deep Sets and its variants can be generalized by differences in the aggregation function, and ii) the behavior of Deep Sets is sensitive to the choice of the aggregation function. From these observations, we show that Deep Sets, one of the well-known permutation-invariant neural networks, can be generalized in the sense of a quasi-arithmetic mean.
☆ Transcribing Bengali Text with Regional Dialects to IPA using District Guided Tokens
Accurate transcription of Bengali text to the International Phonetic Alphabet (IPA) is a challenging task due to the complex phonology of the language and context-dependent sound changes. This challenge is even more for regional Bengali dialects due to unavailability of standardized spelling conventions for these dialects, presence of local and foreign words popular in those regions and phonological diversity across different regions. This paper presents an approach to this sequence-to-sequence problem by introducing the District Guided Tokens (DGT) technique on a new dataset spanning six districts of Bangladesh. The key idea is to provide the model with explicit information about the regional dialect or "district" of the input text before generating the IPA transcription. This is achieved by prepending a district token to the input sequence, effectively guiding the model to understand the unique phonetic patterns associated with each district. The DGT technique is applied to fine-tune several transformer-based models, on this new dataset. Experimental results demonstrate the effectiveness of DGT, with the ByT5 model achieving superior performance over word-based models like mT5, BanglaT5, and umT5. This is attributed to ByT5's ability to handle a high percentage of out-of-vocabulary words in the test set. The proposed approach highlights the importance of incorporating regional dialect information into ubiquitous natural language processing systems for languages with diverse phonological variations. The following work was a result of the "Bhashamul" challenge, which is dedicated to solving the problem of Bengali text with regional dialects to IPA transcription https://www.kaggle.com/competitions/regipa/. The training and inference notebooks are available through the competition link.
comment: This work became the champion of the Bhashamul challenge
☆ Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study
Mixture-of-Experts (MoE) represents an ensemble methodology that amalgamates predictions from several specialized sub-models (referred to as experts). This fusion is accomplished through a router mechanism, dynamically assigning weights to each expert's contribution based on the input data. Conventional MoE mechanisms select all available experts, incurring substantial computational costs. In contrast, Sparse Mixture-of-Experts (Sparse MoE) selectively engages only a limited number, or even just one expert, significantly reducing computation overhead while empirically preserving, and sometimes even enhancing, performance. Despite its wide-ranging applications and these advantageous characteristics, MoE's theoretical underpinnings have remained elusive. In this paper, we embark on an exploration of Sparse MoE's generalization error concerning various critical factors. Specifically, we investigate the impact of the number of data samples, the total number of experts, the sparsity in expert selection, the complexity of the routing mechanism, and the complexity of individual experts. Our analysis sheds light on \textit{how \textbf{sparsity} contributes to the MoE's generalization}, offering insights from the perspective of classical learning theory.
☆ Application-Driven Innovation in Machine Learning
As applications of machine learning proliferate, innovative algorithms inspired by specific real-world challenges have become increasingly important. Such work offers the potential for significant impact not merely in domains of application but also in machine learning itself. In this paper, we describe the paradigm of application-driven research in machine learning, contrasting it with the more standard paradigm of methods-driven research. We illustrate the benefits of application-driven machine learning and how this approach can productively synergize with methods-driven work. Despite these benefits, we find that reviewing, hiring, and teaching practices in machine learning often hold back application-driven innovation. We outline how these processes may be improved.
comment: 12 pages, 3 figures
☆ Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.
comment: Project page is available at https://ku-cvlab.github.io/Perturbed-Attention-Guidance
☆ AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving CVPR-2024
Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
comment: Accepted by CVPR-2024
☆ A Moreau Envelope Approach for LQR Meta-Policy Estimation
We study the problem of policy estimation for the Linear Quadratic Regulator (LQR) in discrete-time linear time-invariant uncertain dynamical systems. We propose a Moreau Envelope-based surrogate LQR cost, built from a finite set of realizations of the uncertain system, to define a meta-policy efficiently adjustable to new realizations. Moreover, we design an algorithm to find an approximate first-order stationary point of the meta-LQR cost function. Numerical results show that the proposed approach outperforms naive averaging of controllers on new realizations of the linear system. We also provide empirical evidence that our method has better sample complexity than Model-Agnostic Meta-Learning (MAML) approaches.
comment: 8 pages
☆ Multi-Objective Trajectory Planning with Dual-Encoder
Time-jerk optimal trajectory planning is crucial in advancing robotic arms' performance in dynamic tasks. Traditional methods rely on solving complex nonlinear programming problems, bringing significant delays in generating optimized trajectories. In this paper, we propose a two-stage approach to accelerate time-jerk optimal trajectory planning. Firstly, we introduce a dual-encoder based transformer model to establish a good preliminary trajectory. This trajectory is subsequently refined through sequential quadratic programming to improve its optimality and robustness. Our approach outperforms the state-of-the-art by up to 79.72\% in reducing trajectory planning time. Compared with existing methods, our method shrinks the optimality gap with the objective function value decreasing by up to 29.9\%.
comment: 6 pages, 7 figures, conference
☆ Learn from Heterophily: Heterophilous Information-enhanced Graph Neural Network
Under circumstances of heterophily, where nodes with different labels tend to be connected based on semantic meanings, Graph Neural Networks (GNNs) often exhibit suboptimal performance. Current studies on graph heterophily mainly focus on aggregation calibration or neighbor extension and address the heterophily issue by utilizing node features or structural information to improve GNN representations. In this paper, we propose and demonstrate that the valuable semantic information inherent in heterophily can be utilized effectively in graph learning by investigating the distribution of neighbors for each individual node within the graph. The theoretical analysis is carried out to demonstrate the efficacy of the idea in enhancing graph learning. Based on this analysis, we propose HiGNN, an innovative approach that constructs an additional new graph structure, that integrates heterophilous information by leveraging node distribution to enhance connectivity between nodes that share similar semantic characteristics. We conduct empirical assessments on node classification tasks using both homophilous and heterophilous benchmark datasets and compare HiGNN to popular GNN baselines and SoTA methods, confirming the effectiveness in improving graph representations. In addition, by incorporating heterophilous information, we demonstrate a notable enhancement in existing GNN-based approaches, and the homophily degree across real-world datasets, thus affirming the efficacy of our approach.
☆ Language Models are Free Boosters for Biomedical Imaging Tasks
In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
☆ The Pursuit of Fairness in Artificial Intelligence Models: A Survey
Artificial Intelligence (AI) models are now being utilized in all facets of our lives such as healthcare, education and employment. Since they are used in numerous sensitive environments and make decisions that can be life altering, potential biased outcomes are a pressing matter. Developers should ensure that such models don't manifest any unexpected discriminatory practices like partiality for certain genders, ethnicities or disabled people. With the ubiquitous dissemination of AI systems, researchers and practitioners are becoming more aware of unfair models and are bound to mitigate bias in them. Significant research has been conducted in addressing such issues to ensure models don't intentionally or unintentionally perpetuate bias. This survey offers a synopsis of the different ways researchers have promoted fairness in AI systems. We explore the different definitions of fairness existing in the current literature. We create a comprehensive taxonomy by categorizing different types of bias and investigate cases of biased AI in different application domains. A thorough study is conducted of the approaches and techniques employed by researchers to mitigate bias in AI models. Moreover, we also delve into the impact of biased models on user experience and the ethical considerations to contemplate when developing and deploying such models. We hope this survey helps researchers and practitioners understand the intricate details of fairness and bias in AI systems. By sharing this thorough survey, we aim to promote additional discourse in the domain of equitable and responsible AI.
comment: 37 pages, 6 figures
☆ Deep Support Vectors
While the success of deep learning is commonly attributed to its theoretical equivalence with Support Vector Machines (SVM), the practical implications of this relationship have not been thoroughly explored. This paper pioneers an exploration in this domain, specifically focusing on the identification of Deep Support Vectors (DSVs) within deep learning models. We introduce the concept of DeepKKT conditions, an adaptation of the traditional Karush-Kuhn-Tucker (KKT) conditions tailored for deep learning. Through empirical investigations, we illustrate that DSVs exhibit similarities to support vectors in SVM, offering a tangible method to interpret the decision-making criteria of models. Additionally, our findings demonstrate that models can be effectively reconstructed using DSVs, resembling the process in SVM. The code will be available.
♻ ☆ LocalTweets to LocalHealth: A Mental Health Surveillance Framework Based on Twitter Data
Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78\%, respectively, a 59\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ An optimal control perspective on diffusion-based generative modeling NeurIPS 2022
We establish a connection between stochastic optimal control and generative models based on stochastic differential equations (SDEs), such as recently developed diffusion probabilistic models. In particular, we derive a Hamilton-Jacobi-Bellman equation that governs the evolution of the log-densities of the underlying SDE marginals. This perspective allows to transfer methods from optimal control theory to generative modeling. First, we show that the evidence lower bound is a direct consequence of the well-known verification theorem from control theory. Further, we can formulate diffusion-based generative modeling as a minimization of the Kullback-Leibler divergence between suitable measures in path space. Finally, we develop a novel diffusion-based method for sampling from unnormalized densities -- a problem frequently occurring in statistics and computational sciences. We demonstrate that our time-reversed diffusion sampler (DIS) can outperform other diffusion-based sampling approaches on multiple numerical examples.
comment: Accepted for oral presentation at NeurIPS 2022 Workshop on Score-Based Methods
♻ ☆ Fully Independent Communication in Multi-Agent Reinforcement Learning AAMAS 2024
Multi-Agent Reinforcement Learning (MARL) comprises a broad area of research within the field of multi-agent systems. Several recent works have focused specifically on the study of communication approaches in MARL. While multiple communication methods have been proposed, these might still be too complex and not easily transferable to more practical contexts. One of the reasons for that is due to the use of the famous parameter sharing trick. In this paper, we investigate how independent learners in MARL that do not share parameters can communicate. We demonstrate that this setting might incur into some problems, to which we propose a new learning scheme as a solution. Our results show that, despite the challenges, independent agents can still learn communication strategies following our method. Additionally, we use this method to investigate how communication in MARL is affected by different network capacities, both for sharing and not sharing parameters. We observe that communication may not always be needed and that the chosen agent network sizes need to be considered when used together with communication in order to achieve efficient learning.
comment: Extended version of the paper appearing on AAMAS 2024 with the same title. 11 pages, 8 figures
♻ ☆ A randomized algorithm for nonconvex minimization with inexact evaluations and complexity guarantees
We consider minimization of a smooth nonconvex function with inexact oracle access to gradient and Hessian (without assuming access to the function value) to achieve approximate second-order optimality. A novel feature of our method is that if an approximate direction of negative curvature is chosen as the step, we choose its sense to be positive or negative with equal probability. We allow gradients to be inexact in a relative sense and relax the coupling between inexactness thresholds for the first- and second-order optimality conditions. Our convergence analysis includes both an expectation bound based on martingale analysis and a high-probability bound based on concentration inequalities. We apply our algorithm to empirical risk minimization problems and obtain improved gradient sample complexity over existing works.
♻ ☆ Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity
Multimodal machine learning with missing modalities is an increasingly relevant challenge arising in various applications such as healthcare. This paper extends the current research into missing modalities to the low-data regime, i.e., a downstream task has both missing modalities and limited sample size issues. This problem setting is particularly challenging and also practical as it is often expensive to get full-modality data and sufficient annotated training samples. We propose to use retrieval-augmented in-context learning to address these two crucial issues by unleashing the potential of a transformer's in-context learning ability. Diverging from existing methods, which primarily belong to the parametric paradigm and often require sufficient training samples, our work exploits the value of the available full-modality data, offering a novel perspective on resolving the challenge. The proposed data-dependent framework exhibits a higher degree of sample efficiency and is empirically demonstrated to enhance the classification model's performance on both full- and missing-modality data in the low-data regime across various multimodal learning tasks. When only 1% of the training data are available, our proposed method demonstrates an average improvement of 6.1% over a recent strong baseline across various datasets and missing states. Notably, our method also reduces the performance gap between full-modality and missing-modality data compared with the baseline.
♻ ☆ Probabilistically Rewired Message-Passing Neural Networks ICLR 2024
Message-passing graph neural networks (MPNNs) emerged as powerful tools for processing graph-structured input. However, they operate on a fixed input graph structure, ignoring potential noise and missing information. Furthermore, their local aggregation mechanism can lead to problems such as over-squashing and limited expressive power in capturing relevant graph structures. Existing solutions to these challenges have primarily relied on heuristic methods, often disregarding the underlying data distribution. Hence, devising principled approaches for learning to infer graph structures relevant to the given prediction task remains an open challenge. In this work, leveraging recent progress in exact and differentiable $k$-subset sampling, we devise probabilistically rewired MPNNs (PR-MPNNs), which learn to add relevant edges while omitting less beneficial ones. For the first time, our theoretical analysis explores how PR-MPNNs enhance expressive power, and we identify precise conditions under which they outperform purely randomized approaches. Empirically, we demonstrate that our approach effectively mitigates issues like over-squashing and under-reaching. In addition, on established real-world datasets, our method exhibits competitive or superior predictive performance compared to traditional MPNN models and recent graph transformer architectures.
comment: ICLR 2024
♻ ☆ Optimal Data Splitting in Distributed Optimization for Machine Learning
The distributed optimization problem has become increasingly relevant recently. It has a lot of advantages such as processing a large amount of data in less time compared to non-distributed methods. However, most distributed approaches suffer from a significant bottleneck - the cost of communications. Therefore, a large amount of research has recently been directed at solving this problem. One such approach uses local data similarity. In particular, there exists an algorithm provably optimally exploiting the similarity property. But this result, as well as results from other works solve the communication bottleneck by focusing only on the fact that communication is significantly more expensive than local computing and does not take into account the various capacities of network devices and the different relationship between communication time and local computing expenses. We consider this setup and the objective of this study is to achieve an optimal ratio of distributed data between the server and local machines for any costs of communications and local computations. The running times of the network are compared between uniform and optimal distributions. The superior theoretical performance of our solutions is experimentally validated.
comment: 17 pages, 2 figures
♻ ☆ Dynamics of Moral Behavior in Heterogeneous Populations of Learning Agents
Growing concerns about safety and alignment of AI systems highlight the importance of embedding moral capabilities in artificial agents. A promising solution is the use of learning from experience, i.e., Reinforcement Learning. In multi-agent (social) environments, complex population-level phenomena may emerge from interactions between individual learning agents. Many of the existing studies rely on simulated social dilemma environments to study the interactions of independent learning agents. However, they tend to ignore the moral heterogeneity that is likely to be present in societies of agents in practice. For example, at different points in time a single learning agent may face opponents who are consequentialist (i.e., caring about maximizing some outcome over time) or norm-based (i.e., focusing on conforming to a specific norm here and now). The extent to which agents' co-development may be impacted by such moral heterogeneity in populations is not well understood. In this paper, we present a study of the learning dynamics of morally heterogeneous populations interacting in a social dilemma setting. Using a Prisoner's Dilemma environment with a partner selection mechanism, we investigate the extent to which the prevalence of diverse moral agents in populations affects individual agents' learning behaviors and emergent population-level outcomes. We observe several types of non-trivial interactions between pro-social and anti-social agents, and find that certain classes of moral agents are able to steer selfish agents towards more cooperative behavior.
♻ ☆ Room Transfer Function Reconstruction Using Complex-valued Neural Networks and Irregularly Distributed Microphones
Reconstructing the room transfer functions needed to calculate the complex sound field in a room has several impor- tant real-world applications. However, an unpractical number of microphones is often required. Recently, in addition to classical signal processing methods, deep learning techniques have been applied to reconstruct the room transfer function starting from a very limited set of measurements at scattered points in the room. In this paper, we employ complex-valued neural networks to estimate room transfer functions in the frequency range of the first room resonances, using a few irregularly distributed microphones. To the best of our knowledge, this is the first time that complex-valued neural networks are used to estimate room transfer functions. To analyze the benefits of applying complex- valued optimization to the considered task, we compare the proposed technique with a state-of-the-art kernel-based signal processing approach for sound field reconstruction, showing that the proposed technique exhibits relevant advantages in terms of phase accuracy and overall quality of the reconstructed sound field. For informative purposes, we also compare the model with a similarly-structured data-driven approach that, however, applies a real-valued neural network to reconstruct only the magnitude of the sound field.
comment: Submitted to EUSIPCO 2024
♻ ☆ Activations and Gradients Compression for Model-Parallel Training
Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $ K=30\%$ worsens model performance significantly.
comment: 17 pages, 6 figures, 5 tables
♻ ☆ Differentially private multivariate medians
Statistical tools which satisfy rigorous privacy guarantees are necessary for modern data analysis. It is well-known that robustness against contamination is linked to differential privacy. Despite this fact, using multivariate medians for differentially private and robust multivariate location estimation has not been systematically studied. We develop novel finite-sample performance guarantees for differentially private multivariate depth-based medians, which are essentially sharp. Our results cover commonly used depth functions, such as the halfspace (or Tukey) depth, spatial depth, and the integrated dual depth. We show that under Cauchy marginals, the cost of heavy-tailed location estimation outweighs the cost of privacy. We demonstrate our results numerically using a Gaussian contamination model in dimensions up to d = 100, and compare them to a state-of-the-art private mean estimation algorithm. As a by-product of our investigation, we prove concentration inequalities for the output of the exponential mechanism about the maximizer of the population objective function. This bound applies to objective functions that satisfy a mild regularity condition.
comment: 42 pages, 3 figures, 2 tables
♻ ☆ AI and Generative AI for Research Discovery and Summarization
AI and generative AI tools, including chatbots like ChatGPT that rely on large language models (LLMs), have burst onto the scene this year, creating incredible opportunities to increase work productivity and improve our lives. Statisticians and data scientists have begun experiencing the benefits from the availability of these tools in numerous ways, such as the generation of programming code from text prompts to analyze data or fit statistical models. One area that these tools can make a substantial impact is in research discovery and summarization. Standalone tools and plugins to chatbots are being developed that allow researchers to more quickly find relevant literature than pre-2023 search tools. Furthermore, generative AI tools have improved to the point where they can summarize and extract the key points from research articles in succinct language. Finally, chatbots based on highly parameterized LLMs can be used to simulate abductive reasoning, which provides researchers the ability to make connections among related technical topics, which can also be used for research discovery. We review the developments in AI and generative AI for research discovery and summarization, and propose directions where these types of tools are likely to head in the future that may be of interest to statistician and data scientists.
comment: 29 pages, 9 figures
♻ ☆ PINN surrogate of Li-ion battery models for parameter inference. Part II: Regularization and application of the pseudo-2D model
Bayesian parameter inference is useful to improve Li-ion battery diagnostics and can help formulate battery aging models. However, it is computationally intensive and cannot be easily repeated for multiple cycles, multiple operating conditions, or multiple replicate cells. To reduce the computational cost of Bayesian calibration, numerical solvers for physics-based models can be replaced with faster surrogates. A physics-informed neural network (PINN) is developed as a surrogate for the pseudo-2D (P2D) battery model calibration. For the P2D surrogate, additional training regularization was needed as compared to the PINN single-particle model (SPM) developed in Part I. Both the PINN SPM and P2D surrogate models are exercised for parameter inference and compared to data obtained from a direct numerical solution of the governing equations. A parameter inference study highlights the ability to use these PINNs to calibrate scaling parameters for the cathode Li diffusion and the anode exchange current density. By realizing computational speed-ups of 2250x for the P2D model, as compared to using standard integrating methods, the PINN surrogates enable rapid state-of-health diagnostics. In the low-data availability scenario, the testing error was estimated to 2mV for the SPM surrogate and 10mV for the P2D surrogate which could be mitigated with additional data.
♻ ☆ ChatGPT Needs SPADE (Sustainability, PrivAcy, Digital divide, and Ethics) Evaluation: A Review
ChatGPT is another large language model (LLM) vastly available for the consumers on their devices but due to its performance and ability to converse effectively, it has gained a huge popularity amongst research as well as industrial community. Recently, many studies have been published to show the effectiveness, efficiency, integration, and sentiments of chatGPT and other LLMs. In contrast, this study focuses on the important aspects that are mostly overlooked, i.e. sustainability, privacy, digital divide, and ethics and suggests that not only chatGPT but every subsequent entry in the category of conversational bots should undergo Sustainability, PrivAcy, Digital divide, and Ethics (SPADE) evaluation. This paper discusses in detail the issues and concerns raised over chatGPT in line with aforementioned characteristics. We also discuss the recent EU AI Act briefly in accordance with the SPADE evaluation. We support our hypothesis by some preliminary data collection and visualizations along with hypothesized facts. We also suggest mitigations and recommendations for each of the concerns. Furthermore, we also suggest some policies and recommendations for AI policy act, if designed by the governments.
comment: 29 pages, 8 figures, 4 tables
♻ ☆ PINN surrogate of Li-ion battery models for parameter inference. Part I: Implementation and multi-fidelity hierarchies for the single-particle model
To plan and optimize energy storage demands that account for Li-ion battery aging dynamics, techniques need to be developed to diagnose battery internal states accurately and rapidly. This study seeks to reduce the computational resources needed to determine a battery's internal states by replacing physics-based Li-ion battery models -- such as the single-particle model (SPM) and the pseudo-2D (P2D) model -- with a physics-informed neural network (PINN) surrogate. The surrogate model makes high-throughput techniques, such as Bayesian calibration, tractable to determine battery internal parameters from voltage responses. This manuscript is the first of a two-part series that introduces PINN surrogates of Li-ion battery models for parameter inference (i.e., state-of-health diagnostics). In this first part, a method is presented for constructing a PINN surrogate of the SPM. A multi-fidelity hierarchical training, where several neural nets are trained with multiple physics-loss fidelities is shown to significantly improve the surrogate accuracy when only training on the governing equation residuals. The implementation is made available in a companion repository (https://github.com/NREL/pinnstripes). The techniques used to develop a PINN surrogate of the SPM are extended in Part II for the PINN surrogate for the P2D battery model, and explore the Bayesian calibration capabilities of both surrogates.
♻ ☆ Disentangling the Spectral Properties of the Hodge Laplacian: Not All Small Eigenvalues Are Equal
The rich spectral information of the graph Laplacian has been instrumental in graph theory, machine learning, and graph signal processing for applications such as graph classification, clustering, or eigenmode analysis. Recently, the Hodge Laplacian has come into focus as a generalisation of the ordinary Laplacian for higher-order graph models such as simplicial and cellular complexes. Akin to the traditional analysis of graph Laplacians, many authors analyse the smallest eigenvalues of the Hodge Laplacian, which are connected to important topological properties such as homology. However, small eigenvalues of the Hodge Laplacian can carry different information depending on whether they are related to curl or gradient eigenmodes, and thus may not be comparable. We therefore introduce the notion of persistent eigenvector similarity and provide a method to track individual harmonic, curl, and gradient eigenvectors/-values through the so-called persistence filtration, leveraging the full information contained in the Hodge-Laplacian spectrum across all possible scales of a point cloud. Finally, we use our insights (a) to introduce a novel form of Hodge spectral clustering and (b) to classify edges and higher-order simplices based on their relationship to the smallest harmonic, curl, and gradient eigenvectors.
comment: 5 pages, 4 figures, comments welcome
♻ ☆ Semi-Supervised Crowd Counting from Unlabeled Data
Automatic Crowd behavior analysis can be applied to effectively help the daily transportation statistics and planning, which helps the smart city construction. As one of the most important keys, crowd counting has drawn increasing attention. Recent works achieved promising performance but relied on the supervised paradigm with expensive crowd annotations. To alleviate the annotation cost in real-world transportation scenarios, in this work we proposed a semi-supervised learning framework $S^{4}\textit{Crowd}$, which can leverage both unlabeled/labeled data for robust crowd counting. In the unsupervised pathway, two \textit{self-supervised losses} were proposed to simulate the crowd variations such as scale, illumination, based on which supervised information pseudo labels were generated and gradually refined. We also proposed a crowd-driven recurrent unit \textit{Gated-Crowd-Recurrent-Unit (GCRU)}, which can preserve discriminant crowd information by extracting second-order statistics, yielding pseudo labels with improved quality. A joint loss including both unsupervised/supervised information was proposed, and a dynamic weighting strategy was employed to balance the importance of the unsupervised loss and supervised loss at different training stages. We conducted extensive experiments on four popular crowd counting datasets in semi-supervised settings. Experimental results supported the effectiveness of each proposed component in our $S^{4}$Crowd framework. Our method achieved competitive performance in semi-supervised learning approaches on these crowd counting datasets.
♻ ☆ Efficient Pre-training for Localized Instruction Generation of Videos
Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets, enables efficient training of large-scale models with competitive performance. We complement our Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and instruction generation for procedural videos. When this model is pre-trained on our curated dataset, it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and Tasty, while using a fraction of the computational resources.
comment: This version has some missing experiments and elaborative technical details
♻ ☆ Toward a Theory of Causation for Interpreting Neural Code Models
Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs.
comment: Accepted to appear in IEEE Transactions on Software Engineering
♻ ☆ RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$
Meta reinforcement learning (meta-RL) methods such as RL$^2$ have emerged as promising approaches for learning data-efficient RL algorithms tailored to a given task distribution. However, they show poor asymptotic performance and struggle with out-of-distribution tasks because they rely on sequence models, such as recurrent neural networks or transformers, to process experiences rather than summarize them using general-purpose RL components such as value functions. In contrast, traditional RL algorithms are data-inefficient as they do not use domain knowledge, but they do converge to an optimal policy in the limit. We propose RL$^3$, a principled hybrid approach that incorporates action-values, learned per task through traditional RL, in the inputs to meta-RL. We show that RL$^3$ earns greater cumulative reward in the long term, compared to RL$^2$, while maintaining data-efficiency in the short term, and generalizes better to out-of-distribution tasks. Experiments are conducted on both custom and benchmark discrete domains from the meta-RL literature that exhibit a range of short-term, long-term, and complex dependencies.
♻ ☆ Multi-Objective Optimization for Sparse Deep Multi-Task Learning
Different conflicting optimization criteria arise naturally in various Deep Learning scenarios. These can address different main tasks (i.e., in the setting of Multi-Task Learning), but also main and secondary tasks such as loss minimization versus sparsity. The usual approach is a simple weighting of the criteria, which formally only works in the convex setting. In this paper, we present a Multi-Objective Optimization algorithm using a modified Weighted Chebyshev scalarization for training Deep Neural Networks (DNNs) with respect to several tasks. By employing this scalarization technique, the algorithm can identify all optimal solutions of the original problem while reducing its complexity to a sequence of single-objective problems. The simplified problems are then solved using an Augmented Lagrangian method, enabling the use of popular optimization techniques such as Adam and Stochastic Gradient Descent, while efficaciously handling constraints. Our work aims to address the (economical and also ecological) sustainability issue of DNN models, with a particular focus on Deep Multi-Task models, which are typically designed with a very large number of weights to perform equally well on multiple tasks. Through experiments conducted on two Machine Learning datasets, we demonstrate the possibility of adaptively sparsifying the model during training without significantly impacting its performance, if we are willing to apply task-specific adaptations to the network weights. Code is available at https://github.com/salomonhotegni/MDMTN
comment: 12 pages, 7 figures
♻ ☆ Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning
Leveraging generative Artificial Intelligence (AI), we have transformed a dataset comprising 1,000 scientific papers into an ontological knowledge graph. Through an in-depth structural analysis, we have calculated node degrees, identified communities and connectivities, and evaluated clustering coefficients and betweenness centrality of pivotal nodes, uncovering fascinating knowledge architectures. The graph has an inherently scale-free nature, is highly connected, and can be used for graph reasoning by taking advantage of transitive and isomorphic properties that reveal unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, propose never-before-seen material designs, and predict material behaviors. We compute deep node embeddings for combinatorial node similarity ranking for use in a path sampling strategy links dissimilar concepts that have previously not been related. One comparison revealed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. In another example, the algorithm proposed a hierarchical mycelium-based composite based on integrating path sampling with principles extracted from Kandinsky's 'Composition VII' painting. The resulting material integrates an innovative set of concepts that include a balance of chaos/order, adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across science, technology and art, revealing a nuanced ontology of immanence that reveal a context-dependent heterarchical interplay of constituents. Graph-based generative AI achieves a far higher degree of novelty, explorative capacity, and technical detail, than conventional approaches and establishes a widely useful framework for innovation by revealing hidden connections.
♻ ☆ Unveiling the Pitfalls of Knowledge Editing for Large Language Models ICLR 2024
As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code and data are available at https://github.com/zjunlp/PitfallsKnowledgeEditing.
comment: ICLR 2024
♻ ☆ Artificial Neural Nets and the Representation of Human Concepts
What do artificial neural networks (ANNs) learn? The machine learning (ML) community shares the narrative that ANNs must develop abstract human concepts to perform complex tasks. Some go even further and believe that these concepts are stored in individual units of the network. Based on current research, I systematically investigate the assumptions underlying this narrative. I conclude that ANNs are indeed capable of performing complex prediction tasks, and that they may learn human and non-human concepts to do so. However, evidence indicates that ANNs do not represent these concepts in individual units.
comment: For: Philosophy of Science for Machine Learning: Core Issues and New Perspectives, edited by Juan Duran and Giorgia Pozzi
♻ ☆ Dual Conic Proxies for AC Optimal Power Flow SC
In recent years, there has been significant interest in the development of machine learning-based optimization proxies for AC Optimal Power Flow (AC-OPF). Although significant progress has been achieved in predicting high-quality primal solutions, no existing learning-based approach can provide valid dual bounds for AC-OPF. This paper addresses this gap by training optimization proxies for a convex relaxation of AC-OPF. Namely, the paper considers a second-order cone (SOC) relaxation of AC-OPF, and proposes \revision{a novel architecture} that embeds a fast, differentiable (dual) feasibility recovery, thus providing valid dual bounds. The paper combines this new architecture with a self-supervised learning scheme, which alleviates the need for costly training data generation. Extensive numerical experiments on medium- and large-scale power grids demonstrate the efficiency and scalability of the proposed methodology.
comment: accepted to PSCC 2024
♻ ☆ Harmonic Control Lyapunov Barrier Functions for Constrained Optimal Control with Reach-Avoid Specifications
This paper introduces harmonic control Lyapunov barrier functions (harmonic CLBF) that aid in constrained control problems such as reach-avoid problems. Harmonic CLBFs exploit the maximum principle that harmonic functions satisfy to encode the properties of control Lyapunov barrier functions (CLBFs). As a result, they can be initiated at the start of an experiment rather than trained based on sample trajectories. The control inputs are selected to maximize the inner product of the system dynamics with the steepest descent direction of the harmonic CLBF. Numerical results are presented with four different systems under different reach-avoid environments. Harmonic CLBFs show a significantly low risk of entering unsafe regions and a high probability of entering the goal region.
♻ ☆ Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods
The popularity of Android means it is a common target for malware. Over the years, various studies have found that machine learning models can effectively discriminate malware from benign applications. However, as the operating system evolves, so does malware, bringing into question the findings of these previous studies, many of which report very high accuracies using small, outdated, and often imbalanced datasets. In this paper, we reimplement 18 representative past works and reevaluate them using a balanced, relevant, and up-to-date dataset comprising 124,000 applications. We also carry out new experiments designed to fill holes in existing knowledge, and use our findings to identify the most effective features and models to use for Android malware detection within a contemporary environment. We show that high detection accuracies (up to 96.8%) can be achieved using features extracted through static analysis alone, yielding a modest benefit (1%) from using far more expensive dynamic analysis. API calls and opcodes are the most productive static and TCP network traffic provide the most predictive dynamic features. Random forests are generally the most effective model, outperforming more complex deep learning approaches. Whilst directly combining static and dynamic features is generally ineffective, ensembling models separately leads to performances comparable to the best models but using less brittle features.
♻ ☆ In Search of a Data Transformation That Accelerates Neural Field Training CVPR 2024
Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.
comment: CVPR 2024
♻ ☆ Coarse-Tuning for Ad-hoc Document Retrieval Using Pre-trained Language Models LREC
Fine-tuning in information retrieval systems using pre-trained language models (PLM-based IR) requires learning query representations and query-document relations, in addition to downstream task-specific learning. This study introduces coarse-tuning as an intermediate learning stage that bridges pre-training and fine-tuning. By learning query representations and query-document relations in coarse-tuning, we aim to reduce the load of fine-tuning and improve the learning effect of downstream IR tasks. We propose Query-Document Pair Prediction (QDPP) for coarse-tuning, which predicts the appropriateness of query-document pairs. Evaluation experiments show that the proposed method significantly improves MRR and/or nDCG@5 in four ad-hoc document retrieval datasets. Furthermore, the results of the query prediction task suggested that coarse-tuning facilitated learning of query representation and query-document relations.
comment: Accepted at LREC-COLING 2024
♻ ☆ Domain Randomization via Entropy Maximization ICLR 2024
Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.
comment: Published as a conference paper at ICLR 2024. Project website at https://gabrieletiboni.github.io/doraemon/
♻ ☆ Byzantine-resilient Federated Learning With Adaptivity to Data Heterogeneity
This paper deals with federated learning (FL) in the presence of malicious Byzantine attacks and data heterogeneity. A novel Robust Average Gradient Algorithm (RAGA) is proposed, which leverages the geometric median for aggregation and can freely select the round number for local updating. Different from most existing resilient approaches, which perform convergence analysis based on strongly-convex loss function or homogeneously distributed dataset, we conduct convergence analysis for not only strongly-convex but also non-convex loss function over heterogeneous dataset. According to our theoretical analysis, as long as the fraction of dataset from malicious users is less than half, RAGA can achieve convergence at rate $\mathcal{O}({1}/{T^{2/3- \delta}})$ where $T$ is the iteration number and $\delta \in (0, 2/3)$ for non-convex loss function, and at linear rate for strongly-convex loss function. Moreover, stationary point or global optimal solution is proved to obtainable as data heterogeneity vanishes. Experimental results corroborate the robustness of RAGA to Byzantine attacks and verifies the advantage of RAGA over baselines on convergence performance under various intensity of Byzantine attacks, for heterogeneous dataset.
♻ ☆ Discretized Distributed Optimization over Dynamic Digraphs
We consider a discrete-time model of continuous-time distributed optimization over dynamic directed-graphs (digraphs) with applications to distributed learning. Our optimization algorithm works over general strongly connected dynamic networks under switching topologies, e.g., in mobile multi-agent systems and volatile networks due to link failures. Compared to many existing lines of work, there is no need for bi-stochastic weight designs on the links. The existing literature mostly needs the link weights to be stochastic using specific weight-design algorithms needed both at the initialization and at all times when the topology of the network changes. This paper eliminates the need for such algorithms and paves the way for distributed optimization over time-varying digraphs. We derive the bound on the gradient-tracking step-size and discrete time-step for convergence and prove dynamic stability using arguments from consensus algorithms, matrix perturbation theory, and Lyapunov theory. This work, particularly, is an improvement over existing stochastic-weight undirected networks in case of link removal or packet drops. This is because the existing literature may need to rerun time-consuming and computationally complex algorithms for stochastic design, while the proposed strategy works as long as the underlying network is weight-symmetric and balanced. The proposed optimization framework finds applications to distributed classification and learning.
♻ ☆ COPR: Continual Learning Human Preference through Optimal Policy Regularization
The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Regularization (COPR), in which we compute the distribution of optimal policy bypassing the partition function and then regularize the current policy based on the historically optimal distribution to mitigate Catastrophic Forgetting (CF). COPR involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data by maintaining a scoring module, similar to reward model, making it flexible for continually learning without human feedback. Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines when it comes to consistently aligning with human preferences on incremental tasks and domains.
♻ ☆ SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching CVPR 2024
In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset.
comment: Accepted to CVPR 2024. Project website: https://sd4match.active.vision/
♻ ☆ Stable Linear Subspace Identification: A Machine Learning Approach
Machine Learning (ML) and linear System Identification (SI) have been historically developed independently. In this paper, we leverage well-established ML tools - especially the automatic differentiation framework - to introduce SIMBa, a family of discrete linear multi-step-ahead state-space SI methods using backpropagation. SIMBa relies on a novel Linear-Matrix-Inequality-based free parametrization of Schur matrices to ensure the stability of the identified model. We show how SIMBa generally outperforms traditional linear state-space SI methods, and sometimes significantly, although at the price of a higher computational burden. This performance gap is particularly remarkable compared to other SI methods with stability guarantees, where the gain is frequently above 25% in our investigations, hinting at SIMBa's ability to simultaneously achieve state-of-the-art fitting performance and enforce stability. Interestingly, these observations hold for a wide variety of input-output systems and on both simulated and real-world data, showcasing the flexibility of the proposed approach. We postulate that this new SI paradigm presents a great extension potential to identify structured nonlinear models from data, and we hence open-source SIMBa on https://github.com/Cemempamoi/simba.
comment: Accepted at ECC 2024
♻ ☆ DeepMachining: Online Prediction of Machining Errors of Lathe Machines
We describe DeepMachining, a deep learning-based AI system for online prediction of machining errors of lathe machine operations. We have built and evaluated DeepMachining based on manufacturing data from factories. Specifically, we first pretrain a deep learning model for a given lathe machine's operations to learn the salient features of machining states. Then, we fine-tune the pretrained model to adapt to specific machining tasks. We demonstrate that DeepMachining achieves high prediction accuracy for multiple tasks that involve different workpieces and cutting tools. To the best of our knowledge, this work is one of the first factory experiments using pre-trained deep-learning models to predict machining errors of lathe machines.
♻ ☆ RetroBridge: Modeling Retrosynthesis with Markov Bridges
Retrosynthesis planning is a fundamental challenge in chemistry which aims at designing reaction pathways from commercially available starting materials to a target molecule. Each step in multi-step retrosynthesis planning requires accurate prediction of possible precursor molecules given the target molecule and confidence estimates to guide heuristic search algorithms. We model single-step retrosynthesis planning as a distribution learning problem in a discrete state space. First, we introduce the Markov Bridge Model, a generative framework aimed to approximate the dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. Our framework is based on the concept of a Markov bridge, a Markov process pinned at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does not need a tractable noise distribution as a sampling proxy and directly operates on the input product molecules as samples from the intractable prior distribution. We then address the retrosynthesis planning problem with our novel framework and introduce RetroBridge, a template-free retrosynthesis modeling approach that achieves state-of-the-art results on standard evaluation benchmarks.
♻ ☆ ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation
System Verilog Assertion (SVA) formulation- a critical yet complex task is a prerequisite in the Formal Property Verification (FPV) process. Traditionally, SVA formulation involves expert-driven interpretation of specifications, which is timeconsuming and prone to human error. However, LLM-informed automatic assertion generation is gaining interest. We designeda novel framework called ChIRAAG, based on OpenAI GPT4, to generate SVA assertions from natural language specifications. ChIRAAG constitutes the systematic breakdown of design specifications into a standardized format, further generating assertions from formatted specifications using LLM. Furthermore, we developed testbenches to verify/validate the LLM-generated assertions. Automatic feedback of log files from the simulation tool to the LLM ensures that the framework can generate correc SVAs automatically. Only 33% of LLM-generated raw assertions had errors. Our results on OpenTitan designs shows that LLMs can streamline and assist engineers in the assertion generation process, reshaping verification workflows.
comment: 6 pages, 5 figures and 2 table
♻ ☆ Towards Low-Energy Adaptive Personalization for Resource-Constrained Devices
The personalization of machine learning (ML) models to address data drift is a significant challenge in the context of Internet of Things (IoT) applications. Presently, most approaches focus on fine-tuning either the full base model or its last few layers to adapt to new data, while often neglecting energy costs. However, various types of data drift exist, and fine-tuning the full base model or the last few layers may not result in optimal performance in certain scenarios. We propose Target Block Fine-Tuning (TBFT), a low-energy adaptive personalization framework designed for resource-constrained devices. We categorize data drift and personalization into three types: input-level, feature-level, and output-level. For each type, we fine-tune different blocks of the model to achieve optimal performance with reduced energy costs. Specifically, input-, feature-, and output-level correspond to fine-tuning the front, middle, and rear blocks of the model. We evaluate TBFT on a ResNet model, three datasets, three different training sizes, and a Raspberry Pi. Compared with the $Block Avg$, where each block is fine-tuned individually and their performance improvements are averaged, TBFT exhibits an improvement in model accuracy by an average of 15.30% whilst saving 41.57% energy consumption on average compared with full fine-tuning.
comment: Accepetd to The 4th Workshop on Machine Learning and Systems (EuroMLSys '24)
♻ ☆ FedCSD: A Federated Learning Based Approach for Code-Smell Detection
This paper proposes a Federated Learning Code Smell Detection (FedCSD) approach that allows organizations to collaboratively train federated ML models while preserving their data privacy. These assertions have been supported by three experiments that have significantly leveraged three manually validated datasets aimed at detecting and examining different code smell scenarios. In experiment 1, which was concerned with a centralized training experiment, dataset two achieved the lowest accuracy (92.30%) with fewer smells, while datasets one and three achieved the highest accuracy with a slight difference (98.90% and 99.5%, respectively). This was followed by experiment 2, which was concerned with cross-evaluation, where each ML model was trained using one dataset, which was then evaluated over the other two datasets. Results from this experiment show a significant drop in the model's accuracy (lowest accuracy: 63.80\%) where fewer smells exist in the training dataset, which has a noticeable reflection (technical debt) on the model's performance. Finally, the last and third experiments evaluate our approach by splitting the dataset into 10 companies. The ML model was trained on the company's site, then all model-updated weights were transferred to the server. Ultimately, an accuracy of 98.34% was achieved by the global model that has been trained using 10 companies for 100 training rounds. The results reveal a slight difference in the global model's accuracy compared to the highest accuracy of the centralized model, which can be ignored in favour of the global model's comprehensive knowledge, lower training cost, preservation of data privacy, and avoidance of the technical debt problem.
comment: 17 pages, 7 figures, Journal paper
♻ ☆ A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
comment: arXiv admin note: text overlap with arXiv:2312.03632
♻ ☆ FedCau: A Proactive Stop Policy for Communication and Computation Efficient Federated Learning
This paper investigates efficient distributed training of a Federated Learning~(FL) model over a wireless network of wireless devices. The communication iterations of the distributed training algorithm may be substantially deteriorated or even blocked by the effects of the devices' background traffic, packet losses, congestion, or latency. We abstract the communication-computation impacts as an `iteration cost' and propose a cost-aware causal FL algorithm~(FedCau) to tackle this problem. We propose an iteration-termination method that trade-offs the training performance and networking costs. We apply our approach when clients use the slotted-ALOHA, the carrier-sense multiple access with collision avoidance~(CSMA/CA), and the orthogonal frequency-division multiple access~(OFDMA) protocols. We show that, given a total cost budget, the training performance degrades as either the background communication traffic or the dimension of the training problem increases. Our results demonstrate the importance of proactively designing optimal cost-efficient stopping criteria to avoid unnecessary communication-computation costs to achieve only a marginal FL training improvement. We validate our method by training and testing FL over the MNIST dataset. Finally, we apply our approach to existing communication efficient FL methods from the literature, achieving further efficiency. We conclude that cost-efficient stopping criteria are essential for the success of practical FL over wireless networks.
♻ ☆ DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.
comment: Project Page: https://yhyang-myron.github.io/DreamComposer/
♻ ☆ Transport meets Variational Inference: Controlled Monte Carlo Diffusions ICML
Connecting optimal transport and variational inference, we present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of the \emph{Controlled Monte Carlo Diffusion} sampler (CMCD) for Bayesian computation, a score-based annealing technique that crucially adapts both forward and backward dynamics in a diffusion model. On the way, we clarify the relationship between the EM-algorithm and iterative proportional fitting (IPF) for Schr{\"o}dinger bridges, deriving as well a regularised objective that bypasses the iterative bottleneck of standard IPF-updates. Finally, we show that CMCD has a strong foundation in the Jarzinsky and Crooks identities from statistical physics, and that it convincingly outperforms competing approaches across a wide array of experiments.
comment: Workshop on New Frontiers in Learning, Control, and Dynamical Systems at the International Conference on Machine Learning (ICML), Honolulu, Hawaii, USA, 2023
♻ ☆ P2ANet: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos
While deep learning has been widely used for video analytics, such as video classification and action detection, dense action detection with fast-moving subjects from sports videos is still challenging. In this work, we release yet another sports video benchmark \TheName{} for \emph{\underline{P}}ing \emph{\underline{P}}ong-\emph{\underline{A}}ction detection, which consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads. We work with a crew of table tennis professionals and referees on a specially designed annotation toolbox to obtain fine-grained action labels (in 14 classes) for every ping-pong action that appeared in the dataset, and formulate two sets of action detection problems -- \emph{action localization} and \emph{action recognition}. We evaluate a number of commonly-seen action recognition (e.g., TSM, TSN, Video SwinTransformer, and Slowfast) and action localization models (e.g., BSN, BSN++, BMN, TCANet), using \TheName{} for both problems, under various settings. These models can only achieve 48\% area under the AR-AN curve for localization and 82\% top-one accuracy for recognition since the ping-pong actions are dense with fast-moving subjects but broadcasting videos are with only 25 FPS. The results confirm that \TheName{} is still a challenging task and can be used as a special benchmark for dense action detection from videos.
♻ ☆ An Implicit GNN Solver for Poisson-like problems
This paper presents $\Psi$-GNN, a novel Graph Neural Network (GNN) approach for solving the ubiquitous Poisson PDE problems with mixed boundary conditions. By leveraging the Implicit Layer Theory, $\Psi$-GNN models an "infinitely" deep network, thus avoiding the empirical tuning of the number of required Message Passing layers to attain the solution. Its original architecture explicitly takes into account the boundary conditions, a critical prerequisite for physical applications, and is able to adapt to any initially provided solution. $\Psi$-GNN is trained using a "physics-informed" loss, and the training process is stable by design, and insensitive to its initialization. Furthermore, the consistency of the approach is theoretically proven, and its flexibility and generalization efficiency are experimentally demonstrated: the same learned model can accurately handle unstructured meshes of various sizes, as well as different boundary conditions. To the best of our knowledge, $\Psi$-GNN is the first physics-informed GNN-based method that can handle various unstructured domains, boundary conditions and initial solutions while also providing convergence guarantees.
♻ ☆ Ensemble learning for Physics Informed Neural Networks: a Gradient Boosting approach
While the popularity of physics-informed neural networks (PINNs) is steadily rising, to this date, PINNs have not been successful in simulating multi-scale and singular perturbation problems. In this work, we present a new training paradigm referred to as "gradient boosting" (GB), which significantly enhances the performance of physics informed neural networks (PINNs). Rather than learning the solution of a given PDE using a single neural network directly, our algorithm employs a sequence of neural networks to achieve a superior outcome. This approach allows us to solve problems presenting great challenges for traditional PINNs. Our numerical experiments demonstrate the effectiveness of our algorithm through various benchmarks, including comparisons with finite element methods and PINNs. Furthermore, this work also unlocks the door to employing ensemble learning techniques in PINNs, providing opportunities for further improvement in solving PDEs.
♻ ☆ Bayesian data-driven discovery of partial differential equations with variable coefficients
The discovery of Partial Differential Equations (PDEs) is an essential task for applied science and engineering. However, data-driven discovery of PDEs is generally challenging, primarily stemming from the sensitivity of the discovered equation to noise and the complexities of model selection. In this work, we propose an advanced Bayesian sparse learning algorithm for PDE discovery with variable coefficients, predominantly when the coefficients are spatially or temporally dependent. Specifically, we apply threshold Bayesian group Lasso regression with a spike-and-slab prior (tBGL-SS) and leverage a Gibbs sampler for Bayesian posterior estimation of PDE coefficients. This approach not only enhances the robustness of point estimation with valid uncertainty quantification but also relaxes the computational burden from Bayesian inference through the integration of coefficient thresholds as an approximate MCMC method. Moreover, from the quantified uncertainties, we propose a Bayesian total error bar criteria for model selection, which outperforms classic metrics including the root mean square and the Akaike information criterion. The capability of this method is illustrated by the discovery of several classical benchmark PDEs with spatially or temporally varying coefficients from solution data obtained from the reference simulations. In the experiments, we show that the tBGL-SS method is more robust than the baseline methods under noisy environments and provides better model selection criteria along the regularization path.
♻ ☆ Graph Signal Diffusion Model for Collaborative Filtering SIGIR 2024
Collaborative filtering is a critical technique in recommender systems. Among various methods, an increasingly popular paradigm is to reconstruct user-item interactions based on the historical observations. This can be viewed as a conditional generative task, where recently developed diffusion model demonstrates great potential. However, existing studies on diffusion models lack effective solutions for modeling implicit feedback data. Particularly, the isotropic nature of the standard diffusion process fails to account for the heterogeneous dependencies among items, leading to a misalignment with the graphical structure of the interaction space. Meanwhile, random noise destroying personalized information in interaction vectors, causing difficulty in reverse reconstruction. In this paper, we make novel adaptions of diffusion model and propose Graph Signal Diffusion Model for Collaborative Filtering (named GiffCF). To better represent the high-dimensional and sparse distribution of implicit feedback, we define a generalized form of denoising diffusion using heat equation on the item-item similarity graph. Our forward process smooths interaction signals with an advanced family of graph filters. Hence, instead of losing information, it involves item-item similarities as beneficial prior knowledge for recommendation. To reconstruct high-quality interactions, our reverse process iteratively refines and sharpens preference signals in a deterministic manner, where the update direction is conditioned on the user history and computed from a carefully designed two-stage denoiser. Finally, through extensive experiments, we show that GiffCF effectively leverages the advantages of both diffusion model and graph signal processing, and achieves state-of-the-art performance on three benchmark datasets.
comment: 11 pages, 8 figures, Accepted by SIGIR 2024
♻ ☆ Riemannian Laplace Approximation with the Fisher Metric AISTATS 2024
Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties depend heavily on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.
comment: AISTATS 2024, with additional fixes
♻ ☆ DISL: Fueling Research with A Large Dataset of Solidity Smart Contracts
The DISL dataset features a collection of $514,506$ unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts. By aggregating every verified smart contract from Etherscan up to January 15, 2024, DISL surpasses existing datasets in size and recency.
♻ ☆ Masked Autoencoders Are Robust Neural Architecture Search Learners
Neural Architecture Search (NAS) currently relies heavily on labeled data, which is both expensive and time-consuming to acquire. In this paper, we propose a novel NAS framework based on Masked Autoencoders (MAE) that eliminates the need for labeled data during the search process. By replacing the supervised learning objective with an image reconstruction task, our approach enables the robust discovery of network architectures without compromising performance and generalization ability. Additionally, we address the problem of performance collapse encountered in the widely-used Differentiable Architecture Search (DARTS) method in the unsupervised paradigm by introducing a multi-scale decoder. Through extensive experiments conducted on various search spaces and datasets, we demonstrate the effectiveness and robustness of the proposed method, providing empirical evidence of its superiority over baseline approaches.
♻ ☆ Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability
Fine-tuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.
♻ ☆ Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models
Learning priors on trajectory distributions can help accelerate robot motion planning optimization. Given previously successful plans, learning trajectory generative models as priors for a new planning problem is highly desirable. Prior works propose several ways on utilizing this prior to bootstrapping the motion planning problem. Either sampling the prior for initializations or using the prior distribution in a maximum-a-posterior formulation for trajectory optimization. In this work, we propose learning diffusion models as priors. We then can sample directly from the posterior trajectory distribution conditioned on task goals, by leveraging the inverse denoising process of diffusion models. Furthermore, diffusion has been recently shown to effectively encode data multimodality in high-dimensional settings, which is particularly well-suited for large trajectory dataset. To demonstrate our method efficacy, we compare our proposed method - Motion Planning Diffusion - against several baselines in simulated planar robot and 7-dof robot arm manipulator environments. To assess the generalization capabilities of our method, we test it in environments with previously unseen obstacles. Our experiments show that diffusion models are strong priors to encode high-dimensional trajectory distributions of robot motions.
♻ ☆ A Lightweight and Gradient-Stable Neural Layer
To enhance resource efficiency and model deployability of neural networks, we propose a neural-layer architecture based on Householder weighting and absolute-value activating, called Householder-absolute neural layer or simply Han-layer. Compared to a fully connected layer with $d$-neurons and $d$ outputs, a Han-layer reduces the number of parameters and the corresponding computational complexity from $O(d^2)$ to $O(d)$. {The Han-layer structure guarantees that the Jacobian of the layer function is always orthogonal, thus ensuring gradient stability (i.e., free of gradient vanishing or exploding issues) for any Han-layer sub-networks.} Extensive numerical experiments show that one can strategically use Han-layers to replace fully connected (FC) layers, reducing the number of model parameters while maintaining or even improving the generalization performance. We will also showcase the capabilities of the Han-layer architecture on a few small stylized models, and discuss its current limitations.
♻ ☆ Brain Networks and Intelligence: A Graph Neural Network Based Approach to Resting State fMRI Data
Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on rsfMRI derived static functional network connectivity matrices. Extending from the existing graph convolution networks, our approach incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer to reflect the nature of the brain sub-network organization and efficient network expression, in combination with TopK pooling and attention-based readout functions. We evaluated our proposed architecture on a large dataset, specifically the Adolescent Brain Cognitive Development Dataset, and demonstrated its effectiveness in predicting individual differences in intelligence. Our model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus exhibited a significant contribution to both fluid and crystallized intelligence, suggesting their pivotal role in these cognitive processes. Total composite scores identified a diverse set of brain regions to be relevant which underscores the complex nature of total intelligence.
♻ ☆ All Rivers Run to the Sea: Private Learning with Asymmetric Flows CVPR 2024
Data privacy is of great concern in cloud machine-learning service platforms, when sensitive data are exposed to service providers. While private computing environments (e.g., secure enclaves), and cryptographic approaches (e.g., homomorphic encryption) provide strong privacy protection, their computing performance still falls short compared to cloud GPUs. To achieve privacy protection with high computing performance, we propose Delta, a new private training and inference framework, with comparable model performance as non-private centralized training. Delta features two asymmetric data flows: the main information-sensitive flow and the residual flow. The main part flows into a small model while the residuals are offloaded to a large model. Specifically, Delta embeds the information-sensitive representations into a low-dimensional space while pushing the information-insensitive part into high-dimension residuals. To ensure privacy protection, the low-dimensional information-sensitive part is secured and fed to a small model in a private environment. On the other hand, the residual part is sent to fast cloud GPUs, and processed by a large model. To further enhance privacy and reduce the communication cost, Delta applies a random binary quantization technique along with a DP-based technique to the residuals before sharing them with the public platform. We theoretically show that Delta guarantees differential privacy in the public environment and greatly reduces the complexity in the private environment. We conduct empirical analyses on CIFAR-10, CIFAR-100 and ImageNet datasets and ResNet-18 and ResNet-34, showing that Delta achieves strong privacy protection, fast training, and inference without significantly compromising the model utility.
comment: Camera-ready for CVPR 2024
♻ ☆ NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks
Protein classification tasks are essential in drug discovery. Real-world protein structures are dynamic, which will determine the properties of proteins. However, the existing machine learning methods, like ProNet (Wang et al., 2022a), only access limited conformational characteristics and protein side-chain features, leading to impractical protein structure and inaccuracy of protein classes in their predictions. In this paper, we propose novel semantic data augmentation methods, Novel Augmentation of New Node Attributes (NaNa), and Molecular Interactions and Geometric Upgrading (MiGu) to incorporate backbone chemical and side-chain biophysical information into protein classification tasks and a co-embedding residual learning framework. Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, and ionic features of proteins to facilitate protein classification tasks. Furthermore, our semantic augmentation methods and the co-embedding residual learning framework can improve the performance of GIN (Xu et al., 2019) on EC and Fold datasets (Bairoch, 2000; Andreeva et al., 2007) by 16.41% and 11.33% respectively. Our code is available at https://github.com/r08b46009/Code_for_MIGU_NANA/tree/main.
♻ ☆ Graph Generation with $K^2$-trees ICLR
Generating graphs from a target distribution is a significant challenge across many domains, including drug discovery and social network analysis. In this work, we introduce a novel graph generation method leveraging $K^2$-tree representation, originally designed for lossless graph compression. The $K^2$-tree representation {encompasses inherent hierarchy while enabling compact graph generation}. In addition, we make contributions by (1) presenting a sequential $K^2$-treerepresentation that incorporates pruning, flattening, and tokenization processes and (2) introducing a Transformer-based architecture designed to generate the sequence by incorporating a specialized tree positional encoding scheme. Finally, we extensively evaluate our algorithm on four general and two molecular graph datasets to confirm its superiority for graph generation.
comment: International Conference on Learning Representations (ICLR) 2024
♻ ☆ A Simple and Scalable Representation for Graph Generation ICLR
Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.
comment: International Conference on Learning Representations (ICLR) 2024
♻ ☆ Identification of Craving Maps among Marijuana Users via the Analysis of Functional Brain Networks with High-Order Attention Graph Neural Networks
The excessive consumption of marijuana can induce substantial psychological and social consequences. In this investigation, we propose an elucidative framework termed high-order graph attention neural networks (HOGANN) for the classification of Marijuana addiction, coupled with an analysis of localized brain network communities exhibiting abnormal activities among chronic marijuana users. HOGANN integrates dynamic intrinsic functional brain networks, estimated from resting-state functional magnetic resonance imaging (rs-fMRI), using long short-term memory (LSTM) to capture temporal network dynamics. We employ a high-order attention module for information fusion and message passing among neighboring nodes, enhancing the network community analysis. Our model is validated across two distinct data cohorts, yielding substantially higher classification accuracy than benchmark algorithms. Furthermore, we discern the most pertinent subnetworks and cognitive regions affected by persistent marijuana consumption, indicating adverse effects on functional brain networks, particularly within the dorsal attention and frontoparietal networks. Intriguingly, our model demonstrates superior performance in cohorts exhibiting prolonged dependence, implying that prolonged marijuana usage induces more pronounced alterations in brain networks. The model proficiently identifies craving brain maps, thereby delineating critical brain regions for analysis.
♻ ☆ Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning CVPR 2024
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution, in this work, we propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. The presented Troika is our implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations, we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings. The code will be available at https://github.com/bighuang624/Troika.
comment: CVPR 2024
♻ ☆ Prediction Error Estimation in Random Forests
In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which are given for logistic regression. We further show that our result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.
comment: arXiv admin note: text overlap with arXiv:2104.00673 by other authors
♻ ☆ Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human judgement, revealing that existing calibration methods aimed at mitigating biases are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts. PairS achieves state-of-the-art performance on representative evaluation tasks and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PairS benefits from calibration.
♻ ☆ Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic COLING 2024
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic. The implementation code for LoT can be accessed at: https://github.com/xf-zhao/LoT.
comment: Accepted in COLING 2024. Code see https://github.com/xf-zhao/LoT
♻ ☆ Learning Flexible Body Collision Dynamics with Hierarchical Contact Mesh Transformer ICLR 2024
Recently, many mesh-based graph neural network (GNN) models have been proposed for modeling complex high-dimensional physical systems. Remarkable achievements have been made in significantly reducing the solving time compared to traditional numerical solvers. These methods are typically designed to i) reduce the computational cost in solving physical dynamics and/or ii) propose techniques to enhance the solution accuracy in fluid and rigid body dynamics. However, it remains under-explored whether they are effective in addressing the challenges of flexible body dynamics, where instantaneous collisions occur within a very short timeframe. In this paper, we present Hierarchical Contact Mesh Transformer (HCMT), which uses hierarchical mesh structures and can learn long-range dependencies (occurred by collisions) among spatially distant positions of a body -- two close positions in a higher-level mesh correspond to two distant positions in a lower-level mesh. HCMT enables long-range interactions, and the hierarchical mesh structure quickly propagates collision effects to faraway positions. To this end, it consists of a contact mesh Transformer and a hierarchical mesh Transformer (CMT and HMT, respectively). Lastly, we propose a flexible body dynamics dataset, consisting of trajectories that reflect experimental settings frequently used in the display industry for product designs. We also compare the performance of several baselines using well-known benchmark datasets. Our results show that HCMT provides significant performance improvements over existing methods. Our code is available at https://github.com/yuyudeep/hcmt.
comment: Accepted at ICLR 2024
Cryptography and Security
☆ Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), such as Gemini-pro, which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data, integrating both textual and visual information on a scale previously unattainable, opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered Gemini-pro LMMs versus fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct tasks: a visually evident task of detecting simple triggers, such as small squares in images, indicative of potential backdoors, and a non-visually evident task of malware classification through visual representations. Our results highlight a significant divergence in performance, with Gemini-pro falling short in accuracy and reliability when compared to fine-tuned ViT models. The ViT models, on the other hand, demonstrate exceptional accuracy, achieving near-perfect performance on both tasks. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.
☆ Secure Aggregation is Not Private Against Membership Inference Attacks
Secure aggregation (SecAgg) is a commonly-used privacy-enhancing mechanism in federated learning, affording the server access only to the aggregate of model updates while safeguarding the confidentiality of individual updates. Despite widespread claims regarding SecAgg's privacy-preserving capabilities, a formal analysis of its privacy is lacking, making such presumptions unjustified. In this paper, we delve into the privacy implications of SecAgg by treating it as a local differential privacy (LDP) mechanism for each local update. We design a simple attack wherein an adversarial server seeks to discern which update vector a client submitted, out of two possible ones, in a single training round of federated learning under SecAgg. By conducting privacy auditing, we assess the success probability of this attack and quantify the LDP guarantees provided by SecAgg. Our numerical results unveil that, contrary to prevailing claims, SecAgg offers weak privacy against membership inference attacks even in a single training round. Indeed, it is difficult to hide a local update by adding other independent local updates when the updates are of high dimension. Our findings underscore the imperative for additional privacy-enhancing mechanisms, such as noise injection, in federated learning.
☆ DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright Protection
In the realm of healthcare, the challenges of copyright protection and unauthorized third-party misuse are increasingly significant. Traditional methods for data copyright protection are applied prior to data distribution, implying that models trained on these data become uncontrollable. This paper introduces a novel approach, named DataCook, designed to safeguard the copyright of healthcare data during the deployment phase. DataCook operates by "cooking" the raw data before distribution, enabling the development of models that perform normally on this processed data. However, during the deployment phase, the original test data must be also "cooked" through DataCook to ensure normal model performance. This process grants copyright holders control over authorization during the deployment phase. The mechanism behind DataCook is by crafting anti-adversarial examples (AntiAdv), which are designed to enhance model confidence, as opposed to standard adversarial examples (Adv) that aim to confuse models. Similar to Adv, AntiAdv introduces imperceptible perturbations, ensuring that the data processed by DataCook remains easily understandable. We conducted extensive experiments on MedMNIST datasets, encompassing both 2D/3D data and the high-resolution variants. The outcomes indicate that DataCook effectively meets its objectives, preventing models trained on AntiAdv from analyzing unauthorized data effectively, without compromising the validity and accuracy of the data in legitimate scenarios. Code and data are available at https://github.com/MedMNIST/DataCook.
☆ Optimization-based Prompt Injection Attack to LLM-as-a-Judge
LLM-as-a-Judge is a novel solution that can assess textual information with large language models (LLMs). Based on existing research studies, LLMs demonstrate remarkable performance in providing a compelling alternative to traditional human assessment. However, the robustness of these systems against prompt injection attacks remains an open question. In this work, we introduce JudgeDeceiver, a novel optimization-based prompt injection attack tailored to LLM-as-a-Judge. Our method formulates a precise optimization objective for attacking the decision-making process of LLM-as-a-Judge and utilizes an optimization algorithm to efficiently automate the generation of adversarial sequences, achieving targeted and effective manipulation of model evaluations. Compared to handcraft prompt injection attacks, our method demonstrates superior efficacy, posing a significant challenge to the current security paradigms of LLM-based judgment systems. Through extensive experiments, we showcase the capability of JudgeDeceiver in altering decision outcomes across various cases, highlighting the vulnerability of LLM-as-a-Judge systems to the optimization-based prompt injection attack.
☆ Depending on yourself when you should: Mentoring LLM with RL agents to become the master in cybersecurity games
Integrating LLM and reinforcement learning (RL) agent effectively to achieve complementary performance is critical in high stake tasks like cybersecurity operations. In this study, we introduce SecurityBot, a LLM agent mentored by pre-trained RL agents, to support cybersecurity operations. In particularly, the LLM agent is supported with a profile module to generated behavior guidelines, a memory module to accumulate local experiences, a reflection module to re-evaluate choices, and an action module to reduce action space. Additionally, it adopts the collaboration mechanism to take suggestions from pre-trained RL agents, including a cursor for dynamic suggestion taken, an aggregator for multiple mentors' suggestions ranking and a caller for proactive suggestion asking. Building on the CybORG experiment framework, our experiences show that SecurityBot demonstrates significant performance improvement compared with LLM or RL standalone, achieving the complementary performance in the cybersecurity games.
comment: 10 pages
☆ How Private is DP-SGD?
We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling based DP-SGD is more commonly used in practical implementations, it is neither analytically nor numerically amenable to easy privacy analysis. On the other hand, Poisson subsampling based DP-SGD is challenging to scalably implement, but has a well-understood privacy analysis, with multiple open-source numerically tight privacy accountants available. This has led to a common practice of using shuffling based DP-SGD in practice, but using the privacy analysis for the corresponding Poisson subsampling version. Our result shows that there can be a substantial gap between the privacy analysis when using the two types of batch sampling, and thus advises caution in reporting privacy parameters for DP-SGD.
☆ Healthcare Data Governance, Privacy, and Security - A Conceptual Framework
The abundance of data has transformed the world in every aspect. It has become the core element in decision making, problem solving, and innovation in almost all areas of life, including business, science, healthcare, education, and many others. Despite all these advances, privacy and security remain critical concerns of the healthcare industry. It is important to note that healthcare data can also be a liability if it is not managed correctly. This data mismanagement can have severe consequences for patients and healthcare organisations, including patient safety, legal liability, damage to reputation, financial loss, and operational inefficiency. Healthcare organisations must comply with a range of regulations to protect patient data. We perform a classification of data governance elements or components in a manner that thoroughly assesses the healthcare data chain from a privacy and security standpoint. After deeply analysing the existing literature, we propose a conceptual privacy and security driven healthcare data governance framework.
☆ Ransomware: Analysis and Evaluation of Live Forensic Techniques and the Impact on Linux based IoT Systems
Ransomware has been predominantly a threat to Windows systems. But, Linux systems became interesting for cybercriminals and this trend is expected to continue. This endangers IoT ecosystems, whereas many IoT systems are based on Linux (e.g. cloud infrastructure and gateways). This paper researches how currently employed forensic techniques can be applied to Linux ransomware and evaluates the maturity as well as the impact on the system. While Windows-based ransomware predominantly uses RSA and AES for key management, a variety of approaches was identified for Linux. Cybercriminals appear to be deliberately moving away from RSA and AES to make Live forensic investigations more difficult. Linux ransomware is developed for a predefined goal and does not exploit the full potential of damage. It appears in an early stage and is expected to reach a similar potential to Windows-based malware. The results generated provided an excellent basic understanding to discuss and assess implications on the IoT industry at an early stage of development.
☆ Provably Secure Disambiguating Neural Linguistic Steganography
Recent research in provably secure neural linguistic steganography has overlooked a crucial aspect: the sender must detokenize stegotexts to avoid raising suspicion from the eavesdropper. The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures in all neural language steganography implementations based on these models. Current solutions to this issue involve altering the probability distribution of candidate words, rendering them incompatible with provably secure steganography. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. We group all tokens with prefix relationships in the candidate pool before the steganographic embedding algorithm runs to eliminate uncertainty among ambiguous tokens. To enable the receiver to synchronize the sampling process of the sender, a shared cryptographically-secure pseudorandom number generator (CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods. We provide theoretical proofs and experimentally demonstrate the applicability of our solution to various languages and models, showing its potential to significantly improve the reliability and security of neural linguistic steganography systems.
☆ FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids
Predicting and classifying faults in electricity networks is crucial for uninterrupted provision and keeping maintenance costs at a minimum. Thanks to the advancements in the field provided by the smart grid, several data-driven approaches have been proposed in the literature to tackle fault prediction tasks. Implementing these systems brought several improvements, such as optimal energy consumption and quick restoration. Thus, they have become an essential component of the smart grid. However, the robustness and security of these systems against adversarial attacks have not yet been extensively investigated. These attacks can impair the whole grid and cause additional damage to the infrastructure, deceiving fault detection systems and disrupting restoration. In this paper, we present FaultGuard, the first framework for fault type and zone classification resilient to adversarial attacks. To ensure the security of our system, we employ an Anomaly Detection System (ADS) leveraging a novel Generative Adversarial Network training layer to identify attacks. Furthermore, we propose a low-complexity fault prediction model and an online adversarial training technique to enhance robustness. We comprehensively evaluate the framework's performance against various adversarial attacks using the IEEE13-AdvAttack dataset, which constitutes the state-of-the-art for resilient fault prediction benchmarking. Our model outclasses the state-of-the-art even without considering adversaries, with an accuracy of up to 0.958. Furthermore, our ADS shows attack detection capabilities with an accuracy of up to 1.000. Finally, we demonstrate how our novel training layers drastically increase performances across the whole framework, with a mean increase of 154% in ADS accuracy and 118% in model accuracy.
☆ Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice
Our paper provides empirical comparisons between recent IDSs to provide an objective comparison between them to help users choose the most appropriate solution based on their requirements. Our results show that no one solution is the best, but is dependent on external variables such as the types of attacks, complexity, and network environment in the dataset. For example, BoT_IoT and Stratosphere IoT datasets both capture IoT-related attacks, but the deep neural network performed the best when tested using the BoT_IoT dataset while HELAD performed the best when tested using the Stratosphere IoT dataset. So although we found that a deep neural network solution had the highest average F1 scores on tested datasets, it is not always the best-performing one. We further discuss difficulties in using IDS from literature and project repositories, which complicated drawing definitive conclusions regarding IDS selection.
comment: 10 pages
☆ The Privacy Policy Permission Model: A Unified View of Privacy Policies
Organizations use privacy policies to communicate their data collection practices to their clients. A privacy policy is a set of statements that specifies how an organization gathers, uses, discloses, and maintains a client's data. However, most privacy policies lack a clear, complete explanation of how data providers' information is used. We propose a modeling methodology, called the Privacy Policy Permission Model (PPPM), that provides a uniform, easy-to-understand representation of privacy policies, which can accurately and clearly show how data is used within an organization's practice. Using this methodology, a privacy policy is captured as a diagram. The diagram is capable of highlighting inconsistencies and inaccuracies in the privacy policy. The methodology supports privacy officers in properly and clearly articulating an organization's privacy policy.
comment: 23 pages + 2 pages references + 11 Pages Appendix, 19 figures,Published in teh Trasactions on Data Privacy in April 2021
☆ Characterizing Dependency Update Practice of NPM, PyPI and Cargo Packages
Keeping dependencies up-to-date prevents software supply chain attacks through outdated and vulnerable dependencies. Developers may use packages' dependency update practice as one of the selection criteria for choosing a package as a dependency. However, the lack of metrics characterizing packages' dependency update practice makes this assessment difficult. To measure the up-to-date characteristics of packages, we focus on the dependency management aspect and propose two update metrics: Time-Out-Of-Date (TOOD) and Post-Fix-Exposure-Time (PFET), to measure the updatedness of dependencies and updatedness of vulnerable dependencies, respectively. We design an algorithm to stabilize the dependency relationships in different time intervals and compute the proposed metrics for each package. Using our proposed metrics, we conduct a large-scale empirical study of update metrics with 2.9M packages, 66.8M package versions, and 26.8M unique package-dependency relations in NPM, PyPI, and Cargo, ranging from the year 2004 to 2023. We analyze the characteristics of the proposed metrics for capturing packages' dependency update practice in the three ecosystems. Given that the TOOD metric generates a greater volume of data than the PFET metric, we further explore the numerical relationship between these metrics to assess their potential as substitutes for vulnerability counts metrics. We find that PyPI packages update dependencies faster than NPM and Cargo. Conversely, Cargo packages update their vulnerable dependencies faster than NPM and PyPI. We also find that the general purpose update metric, TOOD, can be a proxy for the security-focused update metric, PFET.
comment: currently under review
☆ The Solution of the Zodiac Killer's 340-Character Cipher
The case of the Zodiac Killer is one of the most widely known unsolved serial killer cases in history. The unidentified killer murdered five known victims and terrorized the state of California. He also communicated extensively with the press and law enforcement. Besides his murders, Zodiac was known for his use of ciphers. The first Zodiac cipher was solved within a week of its publication, while the second cipher was solved by the authors after 51 years, when it was discovered to be a transposition and homophonic substitution cipher with unusual qualities. In this paper, we detail the historical significance of this cipher and the numerous efforts which culminated in its solution.
☆ Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models USENIX Security 2024
Recent advancements in generative AI have enabled ubiquitous access to large language models (LLMs). Empowered by their exceptional capabilities to understand and generate human-like text, these models are being increasingly integrated into our society. At the same time, there are also concerns on the potential misuse of this powerful technology, prompting defensive measures from service providers. To overcome such protection, jailbreaking prompts have recently emerged as one of the most effective mechanisms to circumvent security restrictions and elicit harmful content originally designed to be prohibited. Due to the rapid development of LLMs and their ease of access via natural languages, the frontline of jailbreak prompts is largely seen in online forums and among hobbyists. To gain a better understanding of the threat landscape of semantically meaningful jailbreak prompts, we systemized existing prompts and measured their jailbreak effectiveness empirically. Further, we conducted a user study involving 92 participants with diverse backgrounds to unveil the process of manually creating jailbreak prompts. We observed that users often succeeded in jailbreak prompts generation regardless of their expertise in LLMs. Building on the insights from the user study, we also developed a system using AI as the assistant to automate the process of jailbreak prompt generation.
comment: Accepted by USENIX Security 2024
☆ Two Birds with One Stone: Differential Privacy by Low-power SRAM Memory
The software-based implementation of differential privacy mechanisms has been shown to be neither friendly for lightweight devices nor secure against side-channel attacks. In this work, we aim to develop a hardware-based technique to achieve differential privacy by design. In contrary to the conventional software-based noise generation and injection process, our design realizes local differential privacy (LDP) by harnessing the inherent hardware noise into controlled LDP noise when data is stored in the memory. Specifically, the noise is tamed through a novel memory design and power downscaling technique, which leads to double-faceted gains in privacy and power efficiency. A well-round study that consists of theoretical design and analysis and chip implementation and experiments is presented. The results confirm that the developed technique is differentially private, saves 88.58% system power, speeds up software-based DP mechanisms by more than 10^6 times, while only incurring 2.46% chip overhead and 7.81% estimation errors in data recovery.
comment: 15 pages, with 2 pages of Appendix
☆ Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving CVPR 2024
Deep learning-based monocular depth estimation (MDE), extensively applied in autonomous driving, is known to be vulnerable to adversarial attacks. Previous physical attacks against MDE models rely on 2D adversarial patches, so they only affect a small, localized region in the MDE map but fail under various viewpoints. To address these limitations, we propose 3D Depth Fool (3D$^2$Fool), the first 3D texture-based adversarial attack against MDE models. 3D$^2$Fool is specifically optimized to generate 3D adversarial textures agnostic to model types of vehicles and to have improved robustness in bad weather conditions, such as rain and fog. Experimental results validate the superior performance of our 3D$^2$Fool across various scenarios, including vehicles, MDE models, weather conditions, and viewpoints. Real-world experiments with printed 3D textures on physical vehicle models further demonstrate that our 3D$^2$Fool can cause an MDE error of over 10 meters.
comment: Accepted by CVPR 2024
☆ Hawk: Accurate and Fast Privacy-Preserving Machine Learning Using Secure Lookup Table Computation
Training machine learning models on data from multiple entities without direct data sharing can unlock applications otherwise hindered by business, legal, or ethical constraints. In this work, we design and implement new privacy-preserving machine learning protocols for logistic regression and neural network models. We adopt a two-server model where data owners secret-share their data between two servers that train and evaluate the model on the joint data. A significant source of inefficiency and inaccuracy in existing methods arises from using Yao's garbled circuits to compute non-linear activation functions. We propose new methods for computing non-linear functions based on secret-shared lookup tables, offering both computational efficiency and improved accuracy. Beyond introducing leakage-free techniques, we initiate the exploration of relaxed security measures for privacy-preserving machine learning. Instead of claiming that the servers gain no knowledge during the computation, we contend that while some information is revealed about access patterns to lookup tables, it maintains epsilon-dX-privacy. Leveraging this relaxation significantly reduces the computational resources needed for training. We present new cryptographic protocols tailored to this relaxed security paradigm and define and analyze the leakage. Our evaluations show that our logistic regression protocol is up to 9x faster, and the neural network training is up to 688x faster than SecureML. Notably, our neural network achieves an accuracy of 96.6% on MNIST in 15 epochs, outperforming prior benchmarks that capped at 93.4% using the same architecture.
comment: Accepted at Privacy Enhancing Technologies Symposium (PETS) 2024
♻ ☆ Differentially private multivariate medians
Statistical tools which satisfy rigorous privacy guarantees are necessary for modern data analysis. It is well-known that robustness against contamination is linked to differential privacy. Despite this fact, using multivariate medians for differentially private and robust multivariate location estimation has not been systematically studied. We develop novel finite-sample performance guarantees for differentially private multivariate depth-based medians, which are essentially sharp. Our results cover commonly used depth functions, such as the halfspace (or Tukey) depth, spatial depth, and the integrated dual depth. We show that under Cauchy marginals, the cost of heavy-tailed location estimation outweighs the cost of privacy. We demonstrate our results numerically using a Gaussian contamination model in dimensions up to d = 100, and compare them to a state-of-the-art private mean estimation algorithm. As a by-product of our investigation, we prove concentration inequalities for the output of the exponential mechanism about the maximizer of the population objective function. This bound applies to objective functions that satisfy a mild regularity condition.
comment: 42 pages, 3 figures, 2 tables
♻ ☆ Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation. Detecting DeepFakes is currently solved with programmed machine learning algorithms. In this work, we investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection. We conducted qualitative and quantitative experiments to demonstrate multimodal LLMs and show that they can expose AI-generated images through careful experimental design and prompt engineering. This is interesting, considering that LLMs are not inherently tailored for media forensic tasks, and the process does not require programming. We discuss the limitations of multimodal LLMs for these tasks and suggest possible improvements.
♻ ☆ Analyzing the Quality Attributes of AI Vision Models in Open Repositories Under Adversarial Attacks
As AI models rapidly evolve, they are frequently released to open repositories, such as HuggingFace. It is essential to perform quality assurance validation on these models before integrating them into the production development lifecycle. In addition to evaluating efficiency in terms of balanced accuracy and computing costs, adversarial attacks are potential threats to the robustness and explainability of AI models. Meanwhile, XAI applies algorithms that approximate inputs to outputs post-hoc to identify the contributing features. Adversarial perturbations may also degrade the utility of XAI explanations that require further investigation. In this paper, we present an integrated process designed for downstream evaluation tasks, including validating AI model accuracy, evaluating robustness with benchmark perturbations, comparing explanation utility, and assessing overhead. We demonstrate an evaluation scenario involving six computer vision models, which include CNN-based, Transformer-based, and hybrid architectures, three types of perturbations, and five XAI methods, resulting in ninety unique combinations. The process reveals the explanation utility among the XAI methods in terms of the identified key areas responding to the adversarial perturbation. The process produces aggregated results that illustrate multiple attributes of each AI model.
♻ ☆ Brokenwire : Wireless Disruption of CCS Electric Vehicle Charging
We present a novel attack against the Combined Charging System, one of the most widely used DC rapid charging technologies for electric vehicles (EVs). Our attack, Brokenwire, interrupts necessary control communication between the vehicle and charger, causing charging sessions to abort. The attack requires only temporary physical proximity and can be conducted wirelessly from a distance, allowing individual vehicles or entire fleets to be disrupted stealthily and simultaneously. In addition, it can be mounted with off-the-shelf radio hardware and minimal technical knowledge. By exploiting CSMA/CA behavior, only a very weak signal needs to be induced into the victim to disrupt communication - exceeding the effectiveness of broadband noise jamming by three orders of magnitude. The exploited behavior is a required part of the HomePlug Green PHY, DIN 70121 & ISO 15118 standards and all known implementations exhibit it. We first study the attack in a controlled testbed and then demonstrate it against eight vehicles and 20 chargers in real deployments. We find the attack to be successful in the real world, at ranges up to 47 m, for a power budget of less than 1 W. We further show that the attack can work between the floors of a building (e.g., multi-story parking), through perimeter fences, and from `drive-by' attacks. We present a heuristic model to estimate the number of vehicles that can be attacked simultaneously for a given output power. Brokenwire has immediate implications for a substantial proportion of the around 12 million battery EVs on the roads worldwide - and profound effects on the new wave of electrification for vehicle fleets, both for private enterprise and crucial public services, as well as electric buses, trucks and small ships. As such, we conducted a disclosure to the industry and discussed a range of mitigation techniques that could be deployed to limit the impact.
♻ ☆ Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods
The popularity of Android means it is a common target for malware. Over the years, various studies have found that machine learning models can effectively discriminate malware from benign applications. However, as the operating system evolves, so does malware, bringing into question the findings of these previous studies, many of which report very high accuracies using small, outdated, and often imbalanced datasets. In this paper, we reimplement 18 representative past works and reevaluate them using a balanced, relevant, and up-to-date dataset comprising 124,000 applications. We also carry out new experiments designed to fill holes in existing knowledge, and use our findings to identify the most effective features and models to use for Android malware detection within a contemporary environment. We show that high detection accuracies (up to 96.8%) can be achieved using features extracted through static analysis alone, yielding a modest benefit (1%) from using far more expensive dynamic analysis. API calls and opcodes are the most productive static and TCP network traffic provide the most predictive dynamic features. Random forests are generally the most effective model, outperforming more complex deep learning approaches. Whilst directly combining static and dynamic features is generally ineffective, ensembling models separately leads to performances comparable to the best models but using less brittle features.
♻ ☆ Towards more Practical Threat Models in Artificial Intelligence Security
Recent works have identified a gap between research and practice in artificial intelligence security: threats studied in academia do not always reflect the practical use and security risks of AI. For example, while models are often studied in isolation, they form part of larger ML pipelines in practice. Recent works also brought forward that adversarial manipulations introduced by academic attacks are impractical. We take a first step towards describing the full extent of this disparity. To this end, we revisit the threat models of the six most studied attacks in AI security research and match them to AI usage in practice via a survey with 271 industrial practitioners. On the one hand, we find that all existing threat models are indeed applicable. On the other hand, there are significant mismatches: research is often too generous with the attacker, assuming access to information not frequently available in real-world settings. Our paper is thus a call for action to study more practical threat models in artificial intelligence security.
comment: 18 pages, 4 figures, 8 tables, accepted to Usenix Security, incorporated external feedback
♻ ☆ Byzantine-resilient Federated Learning With Adaptivity to Data Heterogeneity
This paper deals with federated learning (FL) in the presence of malicious Byzantine attacks and data heterogeneity. A novel Robust Average Gradient Algorithm (RAGA) is proposed, which leverages the geometric median for aggregation and can freely select the round number for local updating. Different from most existing resilient approaches, which perform convergence analysis based on strongly-convex loss function or homogeneously distributed dataset, we conduct convergence analysis for not only strongly-convex but also non-convex loss function over heterogeneous dataset. According to our theoretical analysis, as long as the fraction of dataset from malicious users is less than half, RAGA can achieve convergence at rate $\mathcal{O}({1}/{T^{2/3- \delta}})$ where $T$ is the iteration number and $\delta \in (0, 2/3)$ for non-convex loss function, and at linear rate for strongly-convex loss function. Moreover, stationary point or global optimal solution is proved to obtainable as data heterogeneity vanishes. Experimental results corroborate the robustness of RAGA to Byzantine attacks and verifies the advantage of RAGA over baselines on convergence performance under various intensity of Byzantine attacks, for heterogeneous dataset.
♻ ☆ Clean-image Backdoor Attacks
To gather a significant quantity of annotated training data for high-performance image classification models, numerous companies opt to enlist third-party providers to label their unlabeled data. This practice is widely regarded as secure, even in cases where some annotated errors occur, as the impact of these minor inaccuracies on the final performance of the models is negligible and existing backdoor attacks require attacker's ability to poison the training images. Nevertheless, in this paper, we propose clean-image backdoor attacks which uncover that backdoors can still be injected via a fraction of incorrect labels without modifying the training images. Specifically, in our attacks, the attacker first seeks a trigger feature to divide the training images into two parts: those with the feature and those without it. Subsequently, the attacker falsifies the labels of the former part to a backdoor class. The backdoor will be finally implanted into the target model after it is trained on the poisoned data. During the inference phase, the attacker can activate the backdoor in two ways: slightly modifying the input image to obtain the trigger feature, or taking an image that naturally has the trigger feature as input. We conduct extensive experiments to demonstrate the effectiveness and practicality of our attacks. According to the experimental results, we conclude that our attacks seriously jeopardize the fairness and robustness of image classification models, and it is necessary to be vigilant about the incorrect labels in outsourced labeling.
♻ ☆ All Rivers Run to the Sea: Private Learning with Asymmetric Flows CVPR 2024
Data privacy is of great concern in cloud machine-learning service platforms, when sensitive data are exposed to service providers. While private computing environments (e.g., secure enclaves), and cryptographic approaches (e.g., homomorphic encryption) provide strong privacy protection, their computing performance still falls short compared to cloud GPUs. To achieve privacy protection with high computing performance, we propose Delta, a new private training and inference framework, with comparable model performance as non-private centralized training. Delta features two asymmetric data flows: the main information-sensitive flow and the residual flow. The main part flows into a small model while the residuals are offloaded to a large model. Specifically, Delta embeds the information-sensitive representations into a low-dimensional space while pushing the information-insensitive part into high-dimension residuals. To ensure privacy protection, the low-dimensional information-sensitive part is secured and fed to a small model in a private environment. On the other hand, the residual part is sent to fast cloud GPUs, and processed by a large model. To further enhance privacy and reduce the communication cost, Delta applies a random binary quantization technique along with a DP-based technique to the residuals before sharing them with the public platform. We theoretically show that Delta guarantees differential privacy in the public environment and greatly reduces the complexity in the private environment. We conduct empirical analyses on CIFAR-10, CIFAR-100 and ImageNet datasets and ResNet-18 and ResNet-34, showing that Delta achieves strong privacy protection, fast training, and inference without significantly compromising the model utility.
comment: Camera-ready for CVPR 2024
♻ ☆ Securing Blockchain Systems: A Novel Collaborative Learning Framework to Detect Attacks in Transactions and Smart Contracts
With the escalating prevalence of malicious activities exploiting vulnerabilities in blockchain systems, there is an urgent requirement for robust attack detection mechanisms. To address this challenge, this paper presents a novel collaborative learning framework designed to detect attacks in blockchain transactions and smart contracts by analyzing transaction features. Our framework exhibits the capability to classify various types of blockchain attacks, including intricate attacks at the machine code level (e.g., injecting malicious codes to withdraw coins from users unlawfully), which typically necessitate significant time and security expertise to detect. To achieve that, the proposed framework incorporates a unique tool that transforms transaction features into visual representations, facilitating efficient analysis and classification of low-level machine codes. Furthermore, we propose a customized collaborative learning model to enable real-time detection of diverse attack types at distributed mining nodes. In order to create a comprehensive dataset, we deploy a pilot system based on a private Ethereum network and conduct multiple attack scenarios. To the best of our knowledge, our dataset is the most comprehensive and diverse collection of transactions and smart contracts synthesized in a laboratory for cyberattack detection in blockchain systems. Our framework achieves a detection accuracy of approximately 94\% through extensive simulations and real-time experiments with a throughput of over 2,150 transactions per second. These compelling results validate the efficacy of our framework and showcase its adaptability in addressing real-world cyberattack scenarios.
♻ ☆ Threats, Attacks, and Defenses in Machine Unlearning: A Survey
Machine Unlearning (MU) has gained considerable attention recently for its potential to achieve Safe AI by removing the influence of specific data from trained machine learning models. This process, known as knowledge removal, addresses AI governance concerns of training data such as quality, sensitivity, copyright restrictions, and obsolescence. This capability is also crucial for ensuring compliance with privacy regulations such as the Right To Be Forgotten. Furthermore, effective knowledge removal mitigates the risk of harmful outcomes, safeguarding against biases, misinformation, and unauthorized data exploitation, thereby enhancing the safe and responsible use of AI systems. Efforts have been made to design efficient unlearning approaches, with MU services being examined for integration with existing machine learning as a service, allowing users to submit requests to remove specific data from the training corpus. However, recent research highlights vulnerabilities in machine unlearning systems, such as information leakage and malicious unlearning requests, that can lead to significant security and privacy concerns. Moreover, extensive research indicates that unlearning methods and prevalent attacks fulfill diverse roles within MU systems. For instance, unlearning can act as a mechanism to recover models from backdoor attacks, while backdoor attacks themselves can serve as an evaluation metric for unlearning effectiveness. This underscores the intricate relationship and complex interplay among these mechanisms in maintaining system functionality and safety. This survey aims to fill the gap between the extensive number of studies on threats, attacks, and defenses in machine unlearning and the absence of a comprehensive review that categorizes their taxonomy, methods, and solutions, thus offering valuable insights for future research directions and practical implementations.
♻ ☆ Snail: Secure Single Iteration Localization
Localization is a computer vision task by which the position and orientation of a camera is determined from an image and environmental map. We propose a method for performing localization in a privacy preserving manner supporting two scenarios: first, when the image and map are held by a client who wishes to offload localization to untrusted third parties, and second, when the image and map are held separately by untrusting parties. Privacy preserving localization is necessary when the image and map are confidential, and offloading conserves on-device power and frees resources for other tasks. To accomplish this we integrate existing localization methods and secure multi-party computation (MPC), specifically garbled circuits, yielding proof-based security guarantees in contrast to existing obfuscation-based approaches which recent related work has shown vulnerable. We present two approaches to localization, a baseline data-oblivious adaptation of localization suitable for garbled circuits and our novel Single Iteration Localization. Our technique improves overall performance while maintaining confidentiality of the input image, map, and output pose at the expense of increased communication rounds but reduced computation and communication required per round. Single Iteration Localization is over two orders of magnitude faster than a straightforward application of garbled circuits to localization enabling real-world usage in the first robot to offload localization without revealing input images, environmental map, position, or orientation to offload servers.
Computation and Language
☆ LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
The machine learning community has witnessed impressive advancements since the first appearance of large language models (LLMs), yet their huge memory consumption has become a major roadblock to large-scale training. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem, but their performance still fails to match full parameter training in most large-scale fine-tuning settings. Attempting to complement this deficiency, we investigate layerwise properties of LoRA on fine-tuning tasks and observe an uncommon skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freeze most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over $11\%$-$37\%$ in terms of MT-Bench scores. On large models, specifically LLaMA-2-70B, LISA achieves on-par or better performance than LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.
☆ The Unreasonable Ineffectiveness of the Deeper Layers
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
comment: 12 + 10 pages, 5 + 4 figures
☆ Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications
Natural Language Processing (NLP) models optimized for predictive performance often make high confidence errors and suffer from vulnerability to adversarial and out-of-distribution data. Existing work has mainly focused on mitigation of such errors using either humans or an automated approach. In this study, we explore the usage of large language models (LLMs) for data augmentation as a potential solution to the issue of NLP models making wrong predictions with high confidence during classification tasks. We compare the effectiveness of synthetic data generated by LLMs with that of human data obtained via the same procedure. For mitigation, humans or LLMs provide natural language characterizations of high confidence misclassifications to generate synthetic data, which are then used to extend the training set. We conduct an extensive evaluation of our approach on three classification tasks and demonstrate its effectiveness in reducing the number of high confidence misclassifications present in the model, all while maintaining the same level of accuracy. Moreover, we find that the cost gap between humans and LLMs surpasses an order of magnitude, as LLMs attain human-like performance while being more scalable.
☆ ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages SIGIR 2024
Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale dataset with 485K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.
comment: Accepted at SIGIR 2024
☆ Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs LREC
Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility -- the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models -- two proprietary models (GPT-3.5 and GPT-4), three open-source models (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7B parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.
comment: LREC-COLING 2024
☆ Using Domain Knowledge to Guide Dialog Structure Induction via Neural Probabilistic Soft Logic
Dialog Structure Induction (DSI) is the task of inferring the latent dialog structure (i.e., a set of dialog states and their temporal transitions) of a given goal-oriented dialog. It is a critical component for modern dialog system design and discourse analysis. Existing DSI approaches are often purely data-driven, deploy models that infer latent states without access to domain knowledge, underperform when the training corpus is limited/noisy, or have difficulty when test dialogs exhibit distributional shifts from the training domain. This work explores a neural-symbolic approach as a potential solution to these problems. We introduce Neural Probabilistic Soft Logic Dialogue Structure Induction (NEUPSL DSI), a principled approach that injects symbolic knowledge into the latent space of a generative neural model. We conduct a thorough empirical investigation on the effect of NEUPSL DSI learning on hidden representation quality, few-shot learning, and out-of-domain generalization performance. Over three dialog structure induction datasets and across unsupervised and semi-supervised settings for standard and cross-domain generalization, the injection of symbolic knowledge using NEUPSL DSI provides a consistent boost in performance over the canonical baselines.
☆ ArabicaQA: A Comprehensive Dataset for Arabic Question Answering SIGIR 2024
In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.
comment: Accepted at SIGIR 2024
☆ Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/.
comment: Code and video are available at http://hovsg.github.io/
☆ Graph Language Model (GLM): A new graph-based approach to detect social instabilities
This scientific report presents a novel methodology for the early prediction of important political events using News datasets. The methodology leverages natural language processing, graph theory, clique analysis, and semantic relationships to uncover hidden predictive signals within the data. Initially, we designed a preliminary version of the method and tested it on a few events. This analysis revealed limitations in the initial research phase. We then enhanced the model in two key ways: first, we added a filtration step to only consider politically relevant news before further processing; second, we adjusted the input features to make the alert system more sensitive to significant spikes in the data. After finalizing the improved methodology, we tested it on eleven events including US protests, the Ukraine war, and French protests. Results demonstrate the superiority of our approach compared to baseline methods. Through targeted refinements, our model can now provide earlier and more accurate predictions of major political events based on subtle patterns in news data.
☆ Are Compressed Language Models Less Subgroup Robust? EMNLP 2023
To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression.
comment: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
☆ Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Many recent language model (LM) interpretability studies have adopted the circuits framework, which aims to find the minimal computational subgraph, or circuit, that explains LM behavior on a given task. Most studies determine which edges belong in a LM's circuit by performing causal interventions on each edge independently, but this scales poorly with model size. Edge attribution patching (EAP), gradient-based approximation to interventions, has emerged as a scalable but imperfect solution to this problem. In this paper, we introduce a new method - EAP with integrated gradients (EAP-IG) - that aims to better maintain a core property of circuits: faithfulness. A circuit is faithful if all model edges outside the circuit can be ablated without changing the model's performance on the task; faithfulness is what justifies studying circuits, rather than the full model. Our experiments demonstrate that circuits found using EAP are less faithful than those found using EAP-IG, even though both have high node overlap with circuits found previously using causal interventions. We conclude more generally that when using circuits to compare the mechanisms models use to solve tasks, faithfulness, not overlap, is what should be measured.
☆ Improving Text-to-Image Consistency via Automatic Prompt Optimization
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
☆ SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation LREC
Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.
comment: LREC-COLING 2024 Main Conference Paper
☆ Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons LREC
In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM's understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don't adequately represent their meaning or capture the lexical properties of phrasal heads.
comment: LREC-COLING 2024
☆ Can multiple-choice questions really be useful in detecting the abilities of LLMs?
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs' output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs.
☆ UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies LREC
The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements -- for example, interrogative sentences with special markers and/or word orders -- are not labeled holistically. We argue for (i) augmenting UD annotations with a 'UCxn' annotation layer for such meaning-bearing grammatical constructions, and (ii) approaching this in a typologically informed way so that morphosyntactic strategies can be compared across languages. As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns. In addition to findings regarding these particular constructions, our study yields important insights on methodology for describing and identifying constructions in language-general and language-particular ways, and lays the foundation for future constructional enrichment of UD treebanks.
comment: LREC-COLING 2024
☆ Continual Few-shot Event Detection via Hierarchical Augmentation Networks LREC
Traditional continual event detection relies on abundant labeled data for training, which is often impractical to obtain in real-world applications. In this paper, we introduce continual few-shot event detection (CFED), a more commonly encountered scenario when a substantial number of labeled samples are not accessible. The CFED task is challenging as it involves memorizing previous event types and learning new event types with few-shot samples. To mitigate these challenges, we propose a memory-based framework: Hierarchical Augmentation Networks (HANet). To memorize previous event types with limited memory, we incorporate prototypical augmentation into the memory set. For the issue of learning new event types in few-shot scenarios, we propose a contrastive augmentation module for token representations. Despite comparing with previous state-of-the-art methods, we also conduct comparisons with ChatGPT. Experiment results demonstrate that our method significantly outperforms all of these methods in multiple continual few-shot event detection tasks.
comment: Accepted to LREC-COLING 2024
☆ FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts
Quickly understanding lengthy lecture videos is essential for learners with limited time and interest in various topics to improve their learning efficiency. To this end, video summarization has been actively researched to enable users to view only important scenes from a video. However, these studies focus on either the visual or audio information of a video and extract important segments in the video. Therefore, there is a risk of missing important information when both the teacher's speech and visual information on the blackboard or slides are important, such as in a lecture video. To tackle this issue, we propose FastPerson, a video summarization approach that considers both the visual and auditory information in lecture videos. FastPerson creates summary videos by utilizing audio transcriptions along with on-screen images and text, minimizing the risk of overlooking crucial information for learners. Further, it provides a feature that allows learners to switch between the summary and original videos for each chapter of the video, enabling them to adjust the pace of learning based on their interests and level of understanding. We conducted an evaluation with 40 participants to assess the effectiveness of our method and confirmed that it reduced viewing time by 53\% at the same level of comprehension as that when using traditional video playback methods.
☆ Enhanced Short Text Modeling: Leveraging Large Language Models for Topic Refinement
Crafting effective topic models for brief texts, like tweets and news headlines, is essential for capturing the swift shifts in social dynamics. Traditional topic models, however, often fall short in accurately representing the semantic intricacies of short texts due to their brevity and lack of contextual data. In our study, we harness the advanced capabilities of Large Language Models (LLMs) to introduce a novel approach termed "Topic Refinement". This approach does not directly involve itself in the initial modeling of topics but focuses on improving topics after they have been mined. By employing prompt engineering, we direct LLMs to eliminate off-topic words within a given topic, ensuring that only contextually relevant words are preserved or substituted with ones that fit better semantically. This method emulates human-like scrutiny and improvement of topics, thereby elevating the semantic quality of the topics generated by various models. Our comprehensive evaluation across three unique datasets has shown that our topic refinement approach significantly enhances the semantic coherence of topics.
comment: 6 pages, 4 figures
☆ Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to Inform GenAI Copyright Disputes
The advent of Generative Artificial Intelligence (GenAI) models, including GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content creation, enabling non-professionals to produce high-quality content across various domains. This transformative technology has led to a surge of synthetic content and sparked legal disputes over copyright infringement. To address these challenges, this paper introduces a novel approach that leverages the learning capacity of GenAI models for copyright legal analysis, demonstrated with GPT2 and Stable Diffusion models. Copyright law distinguishes between original expressions and generic ones (Sc\`enes \`a faire), protecting the former and permitting reproduction of the latter. However, this distinction has historically been challenging to make consistently, leading to over-protection of copyrighted works. GenAI offers an unprecedented opportunity to enhance this legal analysis by revealing shared patterns in preexisting works. We propose a data-driven approach to identify the genericity of works created by GenAI, employing "data-driven bias" to assess the genericity of expressive compositions. This approach aids in copyright scope determination by utilizing the capabilities of GenAI to identify and prioritize expressive elements and rank them according to their frequency in the model's dataset. The potential implications of measuring expressive genericity for copyright law are profound. Such scoring could assist courts in determining copyright scope during litigation, inform the registration practices of Copyright Offices, allowing registration of only highly original synthetic works, and help copyright owners signal the value of their works and facilitate fairer licensing deals. More generally, this approach offers valuable insights to policymakers grappling with adapting copyright law to the challenges posed by the era of GenAI.
comment: Presented at ACM CSLAW 2024
☆ Language Models for Text Classification: Is In-Context Learning Enough? LREC
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches based on fine-tuning is the ability to understand instructions written in natural language (prompts), which helps them generalise better to different tasks and domains without the need for specific training data. This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances. However, existing research is limited in scale and lacks understanding of how text generation models combined with prompting techniques compare to more established methods for text classification such as fine-tuning masked language models. In this paper, we address this research gap by performing a large-scale evaluation study for 16 text classification datasets covering binary, multiclass, and multilabel problems. In particular, we compare zero- and few-shot approaches of large language models to fine-tuning smaller language models. We also analyse the results by prompt, classification type, domain, and number of labels. In general, the results show how fine-tuning smaller and more efficient language models can still outperform few-shot approaches of larger language models, which have room for improvement when it comes to text classification.
comment: Accepted at LREC-COLING 2024
☆ Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering LREC
The large success of deep learning based methods in Visual Question Answering (VQA) has concurrently increased the demand for explainable methods. Most methods in Explainable Artificial Intelligence (XAI) focus on generating post-hoc explanations rather than taking an intrinsic approach, the latter characterizing an interpretable model. In this work, we introduce an interpretable approach for graph-based VQA and demonstrate competitive performance on the GQA dataset. This approach bridges the gap between interpretability and performance. Our model is designed to intrinsically produce a subgraph during the question-answering process as its explanation, providing insight into the decision making. To evaluate the quality of these generated subgraphs, we compare them against established post-hoc explainability methods for graph neural networks, and perform a human evaluation. Moreover, we present quantitative metrics that correlate with the evaluations of human assessors, acting as automatic metrics for the generated explanatory subgraphs. Our implementation is available at https://github.com/DigitalPhonetics/Intrinsic-Subgraph-Generation-for-VQA.
comment: Accepted at LREC-COLING 2024
☆ DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition
End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on phonetic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.
☆ REFeREE: A REference-FREE Model-Based Metric for Text Simplification LREC
Text simplification lacks a universal standard of quality, and annotated reference simplifications are scarce and costly. We propose to alleviate such limitations by introducing REFeREE, a reference-free model-based metric with a 3-stage curriculum. REFeREE leverages an arbitrarily scalable pretraining stage and can be applied to any quality standard as long as a small number of human annotations are available. Our experiments show that our metric outperforms existing reference-based metrics in predicting overall ratings and reaches competitive and consistent performance in predicting specific ratings while requiring no reference simplifications at inference time.
comment: Accepted at LREC-COLING 2024
☆ Mix-Initiative Response Generation with Dynamic Prefix Tuning NAACL 2024
Mixed initiative serves as one of the key factors in controlling conversation directions. For a speaker, responding passively or leading proactively would result in rather different responses. However, most dialogue systems focus on training a holistic response generation model without any distinction among different initiatives. It leads to the cross-contamination problem, where the model confuses different initiatives and generates inappropriate responses. Moreover, obtaining plenty of human annotations for initiative labels can be expensive. To address this issue, we propose a general mix-Initiative Dynamic Prefix Tuning framework (IDPT) to decouple different initiatives from the generation model, which learns initiative-aware prefixes in both supervised and unsupervised settings. Specifically, IDPT decouples initiative factors into different prefix parameters and uses the attention mechanism to adjust the selection of initiatives in guiding generation dynamically. The prefix parameters can be tuned towards accurate initiative prediction as well as mix-initiative response generation. Extensive experiments on two public dialogue datasets show that the proposed IDPT outperforms previous baselines on both automatic metrics and human evaluations. It also manages to generate appropriate responses with manipulated initiatives.
comment: Accepted to the main conference of NAACL 2024
☆ "You are an expert annotator": Automatic Best-Worst-Scaling Annotations for Emotion Intensity Modeling NAACL 2024
Labeling corpora constitutes a bottleneck to create models for new tasks or domains. Large language models mitigate the issue with automatic corpus labeling methods, particularly for categorical annotations. Some NLP tasks such as emotion intensity prediction, however, require text regression, but there is no work on automating annotations for continuous label assignments. Regression is considered more challenging than classification: The fact that humans perform worse when tasked to choose values from a rating scale lead to comparative annotation methods, including best-worst scaling. This raises the question if large language model-based annotation methods show similar patterns, namely that they perform worse on rating scale annotation tasks than on comparative annotation tasks. To study this, we automate emotion intensity predictions and compare direct rating scale predictions, pairwise comparisons and best-worst scaling. We find that the latter shows the highest reliability. A transformer regressor fine-tuned on these data performs nearly on par with a model trained on the original manual annotations.
comment: accepted for publication in NAACL 2024
☆ Denoising Table-Text Retrieval for Open-Domain Question Answering LREC
In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.
comment: Accepted to LREC-COLING 2024
☆ Coimagining the Future of Voice Assistants with Cultural Sensitivity
Voice assistants (VAs) are becoming a feature of our everyday life. Yet, the user experience (UX) is often limited, leading to underuse, disengagement, and abandonment. Co-designing interactions for VAs with potential end-users can be useful. Crowdsourcing this process online and anonymously may add value. However, most work has been done in the English-speaking West on dialogue data sets. We must be sensitive to cultural differences in language, social interactions, and attitudes towards technology. Our aims were to explore the value of co-designing VAs in the non-Western context of Japan and demonstrate the necessity of cultural sensitivity. We conducted an online elicitation study (N = 135) where Americans (n = 64) and Japanese people (n = 71) imagined dialogues (N = 282) and activities (N = 73) with future VAs. We discuss the implications for coimagining interactions with future VAs, offer design guidelines for the Japanese and English-speaking US contexts, and suggest opportunities for cultural plurality in VA design and scholarship.
comment: 21 pages
☆ Towards a Zero-Data, Controllable, Adaptive Dialog System
Conversational Tree Search (V\"ath et al., 2023) is a recent approach to controllable dialog systems, where domain experts shape the behavior of a Reinforcement Learning agent through a dialog tree. The agent learns to efficiently navigate this tree, while adapting to information needs, e.g., domain familiarity, of different users. However, the need for additional training data hinders deployment in new domains. To address this, we explore approaches to generate this data directly from dialog trees. We improve the original approach, and show that agents trained on synthetic data can achieve comparable dialog success to models trained on human data, both when using a commercial Large Language Model for generation, or when using a smaller open-source model, running on a single GPU. We further demonstrate the scalability of our approach by collecting and testing on two new datasets: ONBOARD, a new domain helping foreign residents moving to a new city, and the medical domain DIAGNOSE, a subset of Wikipedia articles related to scalp and head symptoms. Finally, we perform human testing, where no statistically significant differences were found in either objective or subjective measures between models trained on human and generated data.
☆ Task-Oriented Paraphrase Analytics LREC
Since paraphrasing is an ill-defined task, the term "paraphrasing" covers text transformation tasks with different characteristics. Consequently, existing paraphrasing studies have applied quite different (explicit and implicit) criteria as to when a pair of texts is to be considered a paraphrase, all of which amount to postulating a certain level of semantic or lexical similarity. In this paper, we conduct a literature review and propose a taxonomy to organize the 25~identified paraphrasing (sub-)tasks. Using classifiers trained to identify the tasks that a given paraphrasing instance fits, we find that the distributions of task-specific instances in the known paraphrase corpora vary substantially. This means that the use of these corpora, without the respective paraphrase conditions being clearly defined (which is the normal case), must lead to incomparable and misleading results.
comment: Accepted at LREC-COLING 2024
☆ m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt COLING 2024
Multilingual translation supports multiple translation directions by projecting all languages in a shared space, but the translation quality is undermined by the difference between languages in the text-only modality, especially when the number of languages is large. To bridge this gap, we introduce visual context as the universal language-independent representation to facilitate multilingual translation. In this paper, we propose a framework to leverage the multimodal prompt to guide the Multimodal Multilingual neural Machine Translation (m3P), which aligns the representations of different languages with the same meaning and generates the conditional vision-language memory for translation. We construct a multilingual multimodal instruction dataset (InstrMulti102) to support 102 languages. Our method aims to minimize the representation distance of different languages by regarding the image as a central language. Experimental results show that m3P outperforms previous text-only baselines and multilingual multimodal methods by a large margin. Furthermore, the probing experiments validate the effectiveness of our method in enhancing translation under the low-resource and massively multilingual scenario.
comment: COLING 2024
☆ RuBia: A Russian Language Bias Detection Dataset LREC
Warning: this work contains upsetting or disturbing content. Large language models (LLMs) tend to learn the social and cultural biases present in the raw pre-training data. To test if an LLM's behavior is fair, functional datasets are employed, and due to their purpose, these datasets are highly language and culture-specific. In this paper, we address a gap in the scope of multilingual bias evaluation by presenting a bias detection dataset specifically designed for the Russian language, dubbed as RuBia. The RuBia dataset is divided into 4 domains: gender, nationality, socio-economic status, and diverse, each of the domains is further divided into multiple fine-grained subdomains. Every example in the dataset consists of two sentences with the first reinforcing a potentially harmful stereotype or trope and the second contradicting it. These sentence pairs were first written by volunteers and then validated by native-speaking crowdsourcing workers. Overall, there are nearly 2,000 unique sentence pairs spread over 19 subdomains in RuBia. To illustrate the dataset's purpose, we conduct a diagnostic evaluation of state-of-the-art or near-state-of-the-art LLMs and discuss the LLMs' predisposition to social biases.
comment: accepted to LREC-COLING 2024
☆ Naive Bayes-based Context Extension for Large Language Models NAACL 2024
Large Language Models (LLMs) have shown promising in-context learning abilities. However, conventional In-Context Learning (ICL) approaches are often impeded by length limitations of transformer architecture, which pose challenges when attempting to effectively integrate supervision from a substantial number of demonstration examples. In this paper, we introduce a novel framework, called Naive Bayes-based Context Extension (NBCE), to enable existing LLMs to perform ICL with an increased number of demonstrations by significantly expanding their context size. Importantly, this expansion does not require fine-tuning or dependence on particular model architectures, all the while preserving linear efficiency. NBCE initially splits the context into equal-sized windows fitting the target LLM's maximum length. Then, it introduces a voting mechanism to select the most relevant window, regarded as the posterior context. Finally, it employs Bayes' theorem to generate the test task. Our experimental results demonstrate that NBCE substantially enhances performance, particularly as the number of demonstration examples increases, consistently outperforming alternative methods. The NBCE code will be made publicly accessible. The code NBCE is available at: https://github.com/amurtadha/NBCE-master
comment: Accepted to main NAACL 2024
☆ Decoding excellence: Mapping the demand for psychological traits of operations and supply chain professionals through text mining
The current study proposes an innovative methodology for the profiling of psychological traits of Operations Management (OM) and Supply Chain Management (SCM) professionals. We use innovative methods and tools of text mining and social network analysis to map the demand for relevant skills from a set of job descriptions, with a focus on psychological characteristics. The proposed approach aims to evaluate the market demand for specific traits by combining relevant psychological constructs, text mining techniques, and an innovative measure, namely, the Semantic Brand Score. We apply the proposed methodology to a dataset of job descriptions for OM and SCM professionals, with the objective of providing a mapping of their relevant required skills, including psychological characteristics. In addition, the analysis is then detailed by considering the region of the organization that issues the job description, its organizational size, and the seniority level of the open position in order to understand their nuances. Finally, topic modeling is used to examine key components and their relative significance in job descriptions. By employing a novel methodology and considering contextual factors, we provide an innovative understanding of the attitudinal traits that differentiate professionals. This research contributes to talent management, recruitment practices, and professional development initiatives, since it provides new figures and perspectives to improve the effectiveness and success of Operations Management and Supply Chain Management professionals.
☆ A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions LREC
Situated conversations, which refer to visual information as visual question answering (VQA), often contain ambiguities caused by reliance on directive information. This problem is exacerbated because some languages, such as Japanese, often omit subjective or objective terms. Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information. In this study, we propose the Gaze-grounded VQA dataset (GazeVQA) that clarifies ambiguous questions using gaze information by focusing on a clarification process complemented by gaze information. We also propose a method that utilizes gaze target estimation results to improve the accuracy of GazeVQA tasks. Our experimental results showed that the proposed method improved the performance in some cases of a VQA system on GazeVQA and identified some typical problems of GazeVQA tasks that need to be improved.
comment: LREC-COLING 2024
☆ Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction
Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks, such as text summarization and machine translation. However, there has been a lack of research on LLMs as evaluators in grammatical error correction (GEC). In this study, we investigate the performance of LLMs in GEC evaluation by employing prompts designed to incorporate various evaluation criteria inspired by previous research. Our extensive experimental results demonstrate that GPT-4 achieved Kendall's rank correlation of 0.662 with human judgments, surpassing all existing methods. Furthermore, in recent GEC evaluations, we have underscored the significance of the LLMs scale and particularly emphasized the importance of fluency among evaluation criteria.
☆ ILLUMINER: Instruction-tuned Large Language Models as Few-shot Intent Classifier and Slot Filler LREC
State-of-the-art intent classification (IC) and slot filling (SF) methods often rely on data-intensive deep learning models, limiting their practicality for industry applications. Large language models on the other hand, particularly instruction-tuned models (Instruct-LLMs), exhibit remarkable zero-shot performance across various natural language tasks. This study evaluates Instruct-LLMs on popular benchmark datasets for IC and SF, emphasizing their capacity to learn from fewer examples. We introduce ILLUMINER, an approach framing IC and SF as language generation tasks for Instruct-LLMs, with a more efficient SF-prompting method compared to prior work. A comprehensive comparison with multiple baselines shows that our approach, using the FLAN-T5 11B model, outperforms the state-of-the-art joint IC+SF method and in-context learning with GPT3.5 (175B), particularly in slot filling by 11.1--32.2 percentage points. Additionally, our in-depth ablation study demonstrates that parameter-efficient fine-tuning requires less than 6% of training data to yield comparable performance with traditional full-weight fine-tuning.
comment: Accepted at LREC-COLING 2024
☆ Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks LREC
Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model's results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.
comment: Published in LREC-Coling 2024 proceedings
☆ Multilingual Sentence-T5: Scalable Sentence Encoders for Multilingual Applications LREC
Prior work on multilingual sentence embedding has demonstrated that the efficient use of natural language inference (NLI) data to build high-performance models can outperform conventional methods. However, the potential benefits from the recent ``exponential'' growth of language models with billions of parameters have not yet been fully explored. In this paper, we introduce Multilingual Sentence T5 (m-ST5), as a larger model of NLI-based multilingual sentence embedding, by extending Sentence T5, an existing monolingual model. By employing the low-rank adaptation (LoRA) technique, we have achieved a successful scaling of the model's size to 5.7 billion parameters. We conducted experiments to evaluate the performance of sentence embedding and verified that the method outperforms the NLI-based prior approach. Furthermore, we also have confirmed a positive correlation between the size of the model and its performance. It was particularly noteworthy that languages with fewer resources or those with less linguistic similarity to English benefited more from the parameter increase. Our model is available at https://huggingface.co/pkshatech/m-ST5.
comment: Accepted in LREC-COLING 2024
☆ Provably Secure Disambiguating Neural Linguistic Steganography
Recent research in provably secure neural linguistic steganography has overlooked a crucial aspect: the sender must detokenize stegotexts to avoid raising suspicion from the eavesdropper. The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures in all neural language steganography implementations based on these models. Current solutions to this issue involve altering the probability distribution of candidate words, rendering them incompatible with provably secure steganography. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. We group all tokens with prefix relationships in the candidate pool before the steganographic embedding algorithm runs to eliminate uncertainty among ambiguous tokens. To enable the receiver to synchronize the sampling process of the sender, a shared cryptographically-secure pseudorandom number generator (CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods. We provide theoretical proofs and experimentally demonstrate the applicability of our solution to various languages and models, showing its potential to significantly improve the reliability and security of neural linguistic steganography systems.
☆ MapGuide: A Simple yet Effective Method to Reconstruct Continuous Language from Brain Activities NAACL 2024
Decoding continuous language from brain activity is a formidable yet promising field of research. It is particularly significant for aiding people with speech disabilities to communicate through brain signals. This field addresses the complex task of mapping brain signals to text. The previous best attempt reverse-engineered this process in an indirect way: it began by learning to encode brain activity from text and then guided text generation by aligning with predicted brain responses. In contrast, we propose a simple yet effective method that guides text reconstruction by directly comparing them with the predicted text embeddings mapped from brain activities. Comprehensive experiments reveal that our method significantly outperforms the current state-of-the-art model, showing average improvements of 77% and 54% on BLEU and METEOR scores. We further validate the proposed modules through detailed ablation studies and case analyses and highlight a critical correlation: the more precisely we map brain activities to text embeddings, the better the text reconstruction results. Such insight can simplify the task of reconstructing language from brain activities for future work, emphasizing the importance of improving brain-to-text-embedding mapping techniques.
comment: Accepted to NAACL 2024 main conference
☆ Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies LREC
In collaborative goal-oriented settings, the participants are not only interested in achieving a successful outcome, but do also implicitly negotiate the effort they put into the interaction (by adapting to each other). In this work, we propose a challenging interactive reference game that requires two players to coordinate on vision and language observations. The learning signal in this game is a score (given after playing) that takes into account the achieved goal and the players' assumed efforts during the interaction. We show that a standard Proximal Policy Optimization (PPO) setup achieves a high success rate when bootstrapped with heuristic partner behaviors that implement insights from the analysis of human-human interactions. And we find that a pairing of neural partners indeed reduces the measured joint effort when playing together repeatedly. However, we observe that in comparison to a reasonable heuristic pairing there is still room for improvement -- which invites further research in the direction of cost-sharing in collaborative interactions.
comment: 9 pages, Accepted at LREC-COLING 2024
☆ DGoT: Dynamic Graph of Thoughts for Scientific Abstract Generation LREC
The method of training language models based on domain datasets has obtained significant achievements in the task of generating scientific paper abstracts. However, such models face problems of generalization and expensive training costs. The use of large language models (LLMs) to solve the task of generating paper abstracts saves the cost of model training. However, due to the hallucination problem of LLM, it is often necessary to improve the reliability of the results through multi-round query prompt approach such as Graph of Thoughts (GoT), which also brings additional reasoning costs. In this paper, we propose a Dynamic Graph of Thought (DGoT). It not only inherits the advantages of the existing GoT prompt approach, but also dynamically adjust the graph structure according to data characteristics while reducing model reasoning cost. Experimental results show that our method's cost-effectiveness in abstract generation tasks is only 43.7% to 56.4% of other multi-round query prompt approaches. Our code is available at https://github.com/JayceNing/DGoT.
comment: Accepted by LREC-COLING 2024
☆ KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning NAACL 2024
Previous work on multimodal sentence embedding has proposed multimodal contrastive learning and achieved promising results. However, by taking the rest of the batch as negative samples without reviewing when forming contrastive pairs, those studies encountered many suspicious and noisy negative examples, significantly affecting the methods' overall performance. In this work, we propose KDMCSE (Knowledge Distillation Multimodal contrastive learning of Sentence Embeddings), a novel approach that enhances the discrimination and generalizability of multimodal representation and inherits the knowledge from the teacher model to learn the difference between positive and negative instances and via that, can detect noisy and wrong negative samples effectively before they are calculated in the contrastive objective. Furthermore, to overcome the limitation of modeling the variation within negative pairs, we introduce a new contrastive objective, AdapACSE (Adaptive Angular Margin Supervised Contrastive Learning for Multimodal sentence embeddings), that enhances the discriminative representation by strengthening the margin within the angular space while capturing varying semantics within the negative. Experimental results on widely used Semantic Textual Similarity (STS) benchmarks demonstrate the effectiveness of our approach.
comment: Accepted to NAACL 2024
☆ Incorporating Exponential Smoothing into MLP: A Simple but Effective Sequence Model
Modeling long-range dependencies in sequential data is a crucial step in sequence learning. A recently developed model, the Structured State Space (S4), demonstrated significant effectiveness in modeling long-range sequences. However, It is unclear whether the success of S4 can be attributed to its intricate parameterization and HiPPO initialization or simply due to State Space Models (SSMs). To further investigate the potential of the deep SSMs, we start with exponential smoothing (ETS), a simple SSM, and propose a stacked architecture by directly incorporating it into an element-wise MLP. We augment simple ETS with additional parameters and complex field to reduce the inductive bias. Despite increasing less than 1\% of parameters of element-wise MLP, our models achieve comparable results to S4 on the LRA benchmark.
comment: 12 pages, 5 tables, 3 figures
☆ Robust and Scalable Model Editing for Large Language Models LREC
Large language models (LLMs) can make predictions using parametric knowledge--knowledge encoded in the model weights--or contextual knowledge--knowledge presented in the context. In many scenarios, a desirable behavior is that LLMs give precedence to contextual knowledge when it conflicts with the parametric knowledge, and fall back to using their parametric knowledge when the context is irrelevant. This enables updating and correcting the model's knowledge by in-context editing instead of retraining. Previous works have shown that LLMs are inclined to ignore contextual knowledge and fail to reliably fall back to parametric knowledge when presented with irrelevant context. In this work, we discover that, with proper prompting methods, instruction-finetuned LLMs can be highly controllable by contextual knowledge and robust to irrelevant context. Utilizing this feature, we propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing. To better evaluate the robustness of model editors, we collect a new dataset, that contains irrelevant questions that are more challenging than the ones in existing datasets. Empirical results show that our method outperforms current state-of-the-art methods by a large margin. Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs (and vice versa). The source code can be found at https://github.com/thunlp/EREN.
comment: LREC-COLING 2024 paper, 16 pages, 4 figures
☆ Aligning Large Language Models for Enhancing Psychiatric Interviews through Symptom Delineation and Summarization
Recent advancements in Large Language Models (LLMs) have accelerated their usage in various domains. Given the fact that psychiatric interviews are goal-oriented and structured dialogues between the professional interviewer and the interviewee, it is one of the most underexplored areas where LLMs can contribute substantial value. Here, we explore the use of LLMs for enhancing psychiatric interviews, by analyzing counseling data from North Korean defectors with traumatic events and mental health issues. Specifically, we investigate whether LLMs can (1) delineate the part of the conversation that suggests psychiatric symptoms and name the symptoms, and (2) summarize stressors and symptoms, based on the interview dialogue transcript. Here, the transcript data was labeled by mental health experts for training and evaluation of LLMs. Our experimental results show that appropriately prompted LLMs can achieve high performance on both the symptom delineation task and the summarization task. This research contributes to the nascent field of applying LLMs to psychiatric interview and demonstrates their potential effectiveness in aiding mental health practitioners.
☆ LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction COLING 2024
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task. Recent work using model ensemble methods based on voting can effectively mitigate over-correction and improve the precision of the GEC system. However, these methods still require the output of several GEC systems and inevitably lead to reduced error recall. In this light, we propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble. Specifically, we train the model on an over-correction dataset constructed through the proposed K-fold cross inference method, which allows it to directly generate filtered sentences by combining the original and the over-corrected text. In the inference stage, we directly take the original sentences and the output results of other systems as input and then obtain the filtered sentences through LM-Combiner. Experiments on the FCGEC dataset show that our proposed method effectively alleviates the over-correction of the original system (+18.2 Precision) while ensuring the error recall remains unchanged. Besides, we find that LM-Combiner still has a good rewriting performance even with small parameters and few training data, and thus can cost-effectively mitigate the over-correction of black-box GEC systems (e.g., ChatGPT).
comment: Accepted to COLING 2024
☆ PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models
Prompt compression is an innovative method for efficiently condensing input prompts while preserving essential information. To facilitate quick-start services, user-friendly interfaces, and compatibility with common datasets and metrics, we present the Prompt Compression Toolkit (PCToolkit). This toolkit is a unified plug-and-play solution for compressing prompts in Large Language Models (LLMs), featuring cutting-edge prompt compressors, diverse datasets, and metrics for comprehensive performance evaluation. PCToolkit boasts a modular design, allowing for easy integration of new datasets and metrics through portable and user-friendly interfaces. In this paper, we outline the key components and functionalities of PCToolkit. We conducted evaluations of the compressors within PCToolkit across various natural language tasks, including reconstruction, summarization, mathematical problem-solving, question answering, few-shot learning, synthetic tasks, code completion, boolean expressions, multiple choice questions, and lies recognition.
comment: For open-source repository, see https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression
☆ Transcribing Bengali Text with Regional Dialects to IPA using District Guided Tokens
Accurate transcription of Bengali text to the International Phonetic Alphabet (IPA) is a challenging task due to the complex phonology of the language and context-dependent sound changes. This challenge is even more for regional Bengali dialects due to unavailability of standardized spelling conventions for these dialects, presence of local and foreign words popular in those regions and phonological diversity across different regions. This paper presents an approach to this sequence-to-sequence problem by introducing the District Guided Tokens (DGT) technique on a new dataset spanning six districts of Bangladesh. The key idea is to provide the model with explicit information about the regional dialect or "district" of the input text before generating the IPA transcription. This is achieved by prepending a district token to the input sequence, effectively guiding the model to understand the unique phonetic patterns associated with each district. The DGT technique is applied to fine-tune several transformer-based models, on this new dataset. Experimental results demonstrate the effectiveness of DGT, with the ByT5 model achieving superior performance over word-based models like mT5, BanglaT5, and umT5. This is attributed to ByT5's ability to handle a high percentage of out-of-vocabulary words in the test set. The proposed approach highlights the importance of incorporating regional dialect information into ubiquitous natural language processing systems for languages with diverse phonological variations. The following work was a result of the "Bhashamul" challenge, which is dedicated to solving the problem of Bengali text with regional dialects to IPA transcription https://www.kaggle.com/competitions/regipa/. The training and inference notebooks are available through the competition link.
comment: This work became the champion of the Bhashamul challenge
☆ ELLEN: Extremely Lightly Supervised Learning For Efficient Named Entity Recognition LREC
In this work, we revisit the problem of semi-supervised named entity recognition (NER) focusing on extremely light supervision, consisting of a lexicon containing only 10 examples per class. We introduce ELLEN, a simple, fully modular, neuro-symbolic method that blends fine-tuned language models with linguistic rules. These rules include insights such as ''One Sense Per Discourse'', using a Masked Language Model as an unsupervised NER, leveraging part-of-speech tags to identify and eliminate unlabeled entities as false negatives, and other intuitions about classifier confidence scores in local and global context. ELLEN achieves very strong performance on the CoNLL-2003 dataset when using the minimal supervision from the lexicon above. It also outperforms most existing (and considerably more complex) semi-supervised NER methods under the same supervision settings commonly used in the literature (i.e., 5% of the training data). Further, we evaluate our CoNLL-2003 model in a zero-shot scenario on WNUT-17 where we find that it outperforms GPT-3.5 and achieves comparable performance to GPT-4. In a zero-shot setting, ELLEN also achieves over 75% of the performance of a strong, fully supervised model trained on gold data. Our code is available at: https://github.com/hriaz17/ELLEN.
comment: Accepted to LREC-COLING 2024
☆ ChatGPT Rates Natural Language Explanation Quality Like Humans: But on Which Scales? LREC
As AI becomes more integral in our lives, the need for transparency and responsibility grows. While natural language explanations (NLEs) are vital for clarifying the reasoning behind AI decisions, evaluating them through human judgments is complex and resource-intensive due to subjectivity and the need for fine-grained ratings. This study explores the alignment between ChatGPT and human assessments across multiple scales (i.e., binary, ternary, and 7-Likert scale). We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores as the text quality measurement. We further conduct paired comparison experiments under different ranges of subjectivity scores, where the baseline comes from 8,346 human annotations. Our results show that ChatGPT aligns better with humans in more coarse-grained scales. Also, paired comparisons and dynamic prompting (i.e., providing semantically similar examples in the prompt) improve the alignment. This research advances our understanding of large language models' capabilities to assess the text explanation quality in different configurations for responsible AI development.
comment: Accpeted by LREC-COLING 2024 main conference, long paper
☆ Extracting Biomedical Entities from Noisy Audio Transcripts LREC
Automatic Speech Recognition (ASR) technology is fundamental in transcribing spoken language into text, with considerable applications in the clinical realm, including streamlining medical transcription and integrating with Electronic Health Record (EHR) systems. Nevertheless, challenges persist, especially when transcriptions contain noise, leading to significant drops in performance when Natural Language Processing (NLP) models are applied. Named Entity Recognition (NER), an essential clinical task, is particularly affected by such noise, often termed the ASR-NLP gap. Prior works have primarily studied ASR's efficiency in clean recordings, leaving a research gap concerning the performance in noisy environments. This paper introduces a novel dataset, BioASR-NER, designed to bridge the ASR-NLP gap in the biomedical domain, focusing on extracting adverse drug reactions and mentions of entities from the Brief Test of Adult Cognition by Telephone (BTACT) exam. Our dataset offers a comprehensive collection of almost 2,000 clean and noisy recordings. In addressing the noise challenge, we present an innovative transcript-cleaning method using GPT4, investigating both zero-shot and few-shot methodologies. Our study further delves into an error analysis, shedding light on the types of errors in transcription software, corrections by GPT4, and the challenges GPT4 faces. This paper aims to foster improved understanding and potential solutions for the ASR-NLP gap, ultimately supporting enhanced healthcare documentation practices.
comment: Accepted to LREC-COLING 2024
☆ Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model LREC
FEVEROUS is a benchmark and research initiative focused on fact extraction and verification tasks involving unstructured text and structured tabular data. In FEVEROUS, existing works often rely on extensive preprocessing and utilize rule-based transformations of data, leading to potential context loss or misleading encodings. This paper introduces a simple yet powerful model that nullifies the need for modality conversion, thereby preserving the original evidence's context. By leveraging pre-trained models on diverse text and tabular datasets and by incorporating a lightweight attention-based mechanism, our approach efficiently exploits latent connections between different data types, thereby yielding comprehensive and reliable verdict predictions. The model's modular structure adeptly manages multi-modal information, ensuring the integrity and authenticity of the original evidence are uncompromised. Comparative analyses reveal that our approach exhibits competitive performance, aligning itself closely with top-tier models on the FEVEROUS benchmark.
comment: Accepted for a presentation at LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
☆ Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models
We present a Chain-of-Action (CoA) framework for multimodal and retrieval-augmented Question-Answering (QA). Compared to the literature, CoA overcomes two major challenges of current QA applications: (i) unfaithful hallucination that is inconsistent with real-time or domain facts and (ii) weak reasoning performance over compositional information. Our key contribution is a novel reasoning-retrieval mechanism that decomposes a complex question into a reasoning chain via systematic prompting and pre-designed actions. Methodologically, we propose three types of domain-adaptable `Plug-and-Play' actions for retrieving real-time information from heterogeneous sources. We also propose a multi-reference faith score (MRFS) to verify and resolve conflicts in the answers. Empirically, we exploit both public benchmarks and a Web3 case study to demonstrate the capability of CoA over other methods.
☆ Disambiguate Entity Matching through Relation Discovery with Large Language Models
Entity matching is a critical challenge in data integration and cleaning, central to tasks like fuzzy joins and deduplication. Traditional approaches have focused on overcoming fuzzy term representations through methods such as edit distance, Jaccard similarity, and more recently, embeddings and deep neural networks, including advancements from large language models (LLMs) like GPT. However, the core challenge in entity matching extends beyond term fuzziness to the ambiguity in defining what constitutes a "match," especially when integrating with external databases. This ambiguity arises due to varying levels of detail and granularity among entities, complicating exact matches. We propose a novel approach that shifts focus from purely identifying semantic similarities to understanding and defining the "relations" between entities as crucial for resolving ambiguities in matching. By predefining a set of relations relevant to the task at hand, our method allows analysts to navigate the spectrum of similarity more effectively, from exact matches to conceptually related entities.
☆ Language Models are Free Boosters for Biomedical Imaging Tasks
In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
☆ Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models USENIX Security 2024
Recent advancements in generative AI have enabled ubiquitous access to large language models (LLMs). Empowered by their exceptional capabilities to understand and generate human-like text, these models are being increasingly integrated into our society. At the same time, there are also concerns on the potential misuse of this powerful technology, prompting defensive measures from service providers. To overcome such protection, jailbreaking prompts have recently emerged as one of the most effective mechanisms to circumvent security restrictions and elicit harmful content originally designed to be prohibited. Due to the rapid development of LLMs and their ease of access via natural languages, the frontline of jailbreak prompts is largely seen in online forums and among hobbyists. To gain a better understanding of the threat landscape of semantically meaningful jailbreak prompts, we systemized existing prompts and measured their jailbreak effectiveness empirically. Further, we conducted a user study involving 92 participants with diverse backgrounds to unveil the process of manually creating jailbreak prompts. We observed that users often succeeded in jailbreak prompts generation regardless of their expertise in LLMs. Building on the insights from the user study, we also developed a system using AI as the assistant to automate the process of jailbreak prompt generation.
comment: Accepted by USENIX Security 2024
☆ JMultiWOZ: A Large-Scale Japanese Multi-Domain Task-Oriented Dialogue Dataset LREC
Dialogue datasets are crucial for deep learning-based task-oriented dialogue system research. While numerous English language multi-domain task-oriented dialogue datasets have been developed and contributed to significant advancements in task-oriented dialogue systems, such a dataset does not exist in Japanese, and research in this area is limited compared to that in English. In this study, towards the advancement of research and development of task-oriented dialogue systems in Japanese, we constructed JMultiWOZ, the first Japanese language large-scale multi-domain task-oriented dialogue dataset. Using JMultiWOZ, we evaluated the dialogue state tracking and response generation capabilities of the state-of-the-art methods on the existing major English benchmark dataset MultiWOZ2.2 and the latest large language model (LLM)-based methods. Our evaluation results demonstrated that JMultiWOZ provides a benchmark that is on par with MultiWOZ2.2. In addition, through evaluation experiments of interactive dialogues with the models and human participants, we identified limitations in the task completion capabilities of LLMs in Japanese.
comment: Accepted by LREC-COLING 2024
☆ Project MOSLA: Recording Every Moment of Second Language Acquisition LREC
Second language acquisition (SLA) is a complex and dynamic process. Many SLA studies that have attempted to record and analyze this process have typically focused on a single modality (e.g., textual output of learners), covered only a short period of time, and/or lacked control (e.g., failed to capture every aspect of the learning process). In Project MOSLA (Moments of Second Language Acquisition), we have created a longitudinal, multimodal, multilingual, and controlled dataset by inviting participants to learn one of three target languages (Arabic, Spanish, and Chinese) from scratch over a span of two years, exclusively through online instruction, and recording every lesson using Zoom. The dataset is semi-automatically annotated with speaker/language IDs and transcripts by both human annotators and fine-tuned state-of-the-art speech models. Our experiments reveal linguistic insights into learners' proficiency development over time, as well as the potential for automatically detecting the areas of focus on the screen purely from the unannotated multimodal data. Our dataset is freely available for research purposes and can serve as a valuable resource for a wide range of applications, including but not limited to SLA, proficiency assessment, language and speech processing, pedagogy, and multimodal learning analytics.
comment: Accepted at LREC-COLING 2024
☆ Neural Multimodal Topic Modeling: A Comprehensive Evaluation LREC
Neural topic models can successfully find coherent and diverse topics in textual data. However, they are limited in dealing with multimodal datasets (e.g., images and text). This paper presents the first systematic and comprehensive evaluation of multimodal topic modeling of documents containing both text and images. In the process, we propose two novel topic modeling solutions and two novel evaluation metrics. Overall, our evaluation on an unprecedented rich and diverse collection of datasets indicates that both of our models generate coherent and diverse topics. Nevertheless, the extent to which one method outperforms the other depends on the metrics and dataset combinations, which suggests further exploration of hybrid solutions in the future. Notably, our succinct human evaluation aligns with the outcomes determined by our proposed metrics. This alignment not only reinforces the credibility of our metrics but also highlights the potential for their application in guiding future multimodal topic modeling endeavors.
comment: Camera-Ready for LREC-COLING 2024 (Long Paper)
☆ HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification NAACL 2024
Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.
comment: Accepted by NAACL 2024
☆ Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models using Minimal Pairs LREC
Inspired by cognitive neuroscience studies, we introduce a novel `decoding probing' method that uses minimal pairs benchmark (BLiMP) to probe internal linguistic characteristics in neural language models layer by layer. By treating the language model as the `brain' and its representations as `neural activations', we decode grammaticality labels of minimal pairs from the intermediate layers' representations. This approach reveals: 1) Self-supervised language models capture abstract linguistic structures in intermediate layers that GloVe and RNN language models cannot learn. 2) Information about syntactic grammaticality is robustly captured through the first third layers of GPT-2 and also distributed in later layers. As sentence complexity increases, more layers are required for learning grammatical capabilities. 3) Morphological and semantics/syntax interface-related features are harder to capture than syntax. 4) For Transformer-based models, both embeddings and attentions capture grammatical features but show distinct patterns. Different attention heads exhibit similar tendencies toward various linguistic phenomena, but with varied contributions.
comment: Accepted by LREC-COLING 2024
☆ InternLM2 Technical Report
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
☆ Common Ground Tracking in Multimodal Dialogue
Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on ``dialogue state tracking'' (DST), which is the ability to update the representations of the speaker's needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is ``common ground tracking'' (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and ``questions under discussion'' (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.
☆ Automate Knowledge Concept Tagging on Math Questions with LLMs
Knowledge concept tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations have been conducted manually with help from pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitions but also deep insights into connecting question-solving logic with corresponding knowledge concepts. In this paper, we explore automating the tagging task using Large Language Models (LLMs), in response to the inability of prior manual methods to meet the rapidly growing demand for concept tagging in questions posed by advanced educational applications. Moreover, the zero/few-shot learning capability of LLMs makes them well-suited for application in educational scenarios, which often face challenges in collecting large-scale, expertise-annotated datasets. By conducting extensive experiments with a variety of representative LLMs, we demonstrate that LLMs are a promising tool for concept tagging in math questions. Furthermore, through case studies examining the results from different LLMs, we draw some empirical conclusions about the key factors for success in applying LLMs to the automatic concept tagging task.
comment: 7 pages, 2 figures
♻ ☆ LocalTweets to LocalHealth: A Mental Health Surveillance Framework Based on Twitter Data
Prior research on Twitter (now X) data has provided positive evidence of its utility in developing supplementary health surveillance systems. In this study, we present a new framework to surveil public health, focusing on mental health (MH) outcomes. We hypothesize that locally posted tweets are indicative of local MH outcomes and collect tweets posted from 765 neighborhoods (census block groups) in the USA. We pair these tweets from each neighborhood with the corresponding MH outcome reported by the Center for Disease Control (CDC) to create a benchmark dataset, LocalTweets. With LocalTweets, we present the first population-level evaluation task for Twitter-based MH surveillance systems. We then develop an efficient and effective method, LocalHealth, for predicting MH outcomes based on LocalTweets. When used with GPT3.5, LocalHealth achieves the highest F1-score and accuracy of 0.7429 and 79.78\%, respectively, a 59\% improvement in F1-score over the GPT3.5 in zero-shot setting. We also utilize LocalHealth to extrapolate CDC's estimates to proxy unreported neighborhoods, achieving an F1-score of 0.7291. Our work suggests that Twitter data can be effectively leveraged to simulate neighborhood-level MH outcomes.
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling LREC
Topic modelling, as a well-established unsupervised technique, has found extensive use in automatically detecting significant topics within a corpus of documents. However, classic topic modelling approaches (e.g., LDA) have certain drawbacks, such as the lack of semantic understanding and the presence of overlapping topics. In this work, we investigate the untapped potential of large language models (LLMs) as an alternative for uncovering the underlying topics within extensive text corpora. To this end, we introduce a framework that prompts LLMs to generate topics from a given set of documents and establish evaluation protocols to assess the clustering efficacy of LLMs. Our findings indicate that LLMs with appropriate prompts can stand out as a viable alternative, capable of generating relevant topic titles and adhering to human guidelines to refine and merge topics. Through in-depth experiments and evaluation, we summarise the advantages and constraints of employing LLMs in topic extraction.
comment: Accepted at LREC-COLING 2024
♻ ☆ AI and Generative AI for Research Discovery and Summarization
AI and generative AI tools, including chatbots like ChatGPT that rely on large language models (LLMs), have burst onto the scene this year, creating incredible opportunities to increase work productivity and improve our lives. Statisticians and data scientists have begun experiencing the benefits from the availability of these tools in numerous ways, such as the generation of programming code from text prompts to analyze data or fit statistical models. One area that these tools can make a substantial impact is in research discovery and summarization. Standalone tools and plugins to chatbots are being developed that allow researchers to more quickly find relevant literature than pre-2023 search tools. Furthermore, generative AI tools have improved to the point where they can summarize and extract the key points from research articles in succinct language. Finally, chatbots based on highly parameterized LLMs can be used to simulate abductive reasoning, which provides researchers the ability to make connections among related technical topics, which can also be used for research discovery. We review the developments in AI and generative AI for research discovery and summarization, and propose directions where these types of tools are likely to head in the future that may be of interest to statistician and data scientists.
comment: 29 pages, 9 figures
♻ ☆ Generator-Retriever-Generator Approach for Open-Domain Question Answering
Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG) that combines document retrieval techniques with a large language model (LLM), by first prompting the model to generate contextual documents based on a given question. In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus. The generated and retrieved documents are then passed to the second LLM, which generates the final answer. By combining document retrieval and LLM generation, our approach addresses the challenges of open-domain QA, such as generating informative and contextually relevant answers. GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines (GENREAD and RFiD) improving their performance by at least by +5.2, +4.2, and +1.6 on TriviaQA, NQ, and WebQ datasets, respectively. We provide code, datasets, and checkpoints at https://github.com/abdoelsayed2016/GRG.
♻ ☆ ChatGPT Needs SPADE (Sustainability, PrivAcy, Digital divide, and Ethics) Evaluation: A Review
ChatGPT is another large language model (LLM) vastly available for the consumers on their devices but due to its performance and ability to converse effectively, it has gained a huge popularity amongst research as well as industrial community. Recently, many studies have been published to show the effectiveness, efficiency, integration, and sentiments of chatGPT and other LLMs. In contrast, this study focuses on the important aspects that are mostly overlooked, i.e. sustainability, privacy, digital divide, and ethics and suggests that not only chatGPT but every subsequent entry in the category of conversational bots should undergo Sustainability, PrivAcy, Digital divide, and Ethics (SPADE) evaluation. This paper discusses in detail the issues and concerns raised over chatGPT in line with aforementioned characteristics. We also discuss the recent EU AI Act briefly in accordance with the SPADE evaluation. We support our hypothesis by some preliminary data collection and visualizations along with hypothesized facts. We also suggest mitigations and recommendations for each of the concerns. Furthermore, we also suggest some policies and recommendations for AI policy act, if designed by the governments.
comment: 29 pages, 8 figures, 4 tables
♻ ☆ AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification
The extraction of key information from receipts is a complex task that involves the recognition and extraction of text from scanned receipts. This process is crucial as it enables the retrieval of essential content and organizing it into structured documents for easy access and analysis. In this paper, we present AMuRD, a novel multilingual human-annotated dataset specifically designed for information extraction from receipts. This dataset comprises $47,720$ samples and addresses the key challenges in information extraction and item classification - the two critical aspects of data analysis in the retail industry. Each sample includes annotations for item names and attributes such as price, brand, and more. This detailed annotation facilitates a comprehensive understanding of each item on the receipt. Furthermore, the dataset provides classification into $44$ distinct product categories. This classification feature allows for a more organized and efficient analysis of the items, enhancing the usability of the dataset for various applications. In our study, we evaluated various language model architectures, e.g., by fine-tuning LLaMA models on the AMuRD dataset. Our approach yielded exceptional results, with an F1 score of 97.43\% and accuracy of 94.99\% in information extraction and classification, and an even higher F1 score of 98.51\% and accuracy of 97.06\% observed in specific tasks. The dataset and code are publicly accessible for further researchhttps://github.com/Update-For-Integrated-Business-AI/AMuRD.
♻ ☆ Training BERT Models to Carry Over a Coding System Developed on One Corpus to Another LREC
This paper describes how we train BERT models to carry over a coding system developed on the paragraphs of a Hungarian literary journal to another. The aim of the coding system is to track trends in the perception of literary translation around the political transformation in 1989 in Hungary. To evaluate not only task performance but also the consistence of the annotation, moreover, to get better predictions from an ensemble, we use 10-fold crossvalidation. Extensive hyperparameter tuning is used to obtain the best possible results and fair comparisons. To handle label imbalance, we use loss functions and metrics robust to it. Evaluation of the effect of domain shift is carried out by sampling a test set from the target domain. We establish the sample size by estimating the bootstrapped confidence interval via simulations. This way, we show that our models can carry over one annotation system to the target domain. Comparisons are drawn to provide insights such as learning multilabel correlations and confidence penalty improve resistance to domain shift, and domain adaptation on OCR-ed text on another domain improves performance almost to the same extent as that on the corpus under study. See our code at https://codeberg.org/zsamboki/bert-annotator-ensemble.
comment: Camera-ready version, to be presented at the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
♻ ☆ Efficient Pre-training for Localized Instruction Generation of Videos
Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets, enables efficient training of large-scale models with competitive performance. We complement our Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and instruction generation for procedural videos. When this model is pre-trained on our curated dataset, it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and Tasty, while using a fraction of the computational resources.
comment: This version has some missing experiments and elaborative technical details
♻ ☆ Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA?
While auxiliary information has become a key to enhancing Large Language Models (LLMs), relatively little is known about how LLMs merge these contexts, specifically contexts generated by LLMs and those retrieved from external sources. To investigate this, we formulate a systematic framework to identify whether LLMs' responses, derived from the integration of generated and retrieved contexts, are attributed to either generated or retrieved contexts. To easily trace the origin of the response, we construct datasets with conflicting contexts, i.e., each question is paired with both generated and retrieved contexts, yet only one of them contains the correct answer. Our experiments reveal a significant bias in several LLMs (GPT-4/3.5 and Llama2) to favor generated contexts, even when they provide incorrect information. We further identify two key factors contributing to this bias: i) contexts generated by LLMs typically show greater similarity to the questions, increasing their likelihood of being selected; ii) the segmentation process used in retrieved contexts disrupts their completeness, thereby hindering their full utilization in LLMs. Our analysis enhances the understanding of how LLMs merge diverse contexts, offering valuable insights for advancing current augmentation methods for LLMs.
♻ ☆ Measuring Entrainment in Spontaneous Code-switched Speech NAACL 2024
It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken monolingual domains. More recent work on code-switched communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such studies of entrainment in code-switched domains have been extremely few and restricted to human-machine textual interactions. Our work studies code-switched spontaneous speech between humans, finding that (1) patterns of written and spoken entrainment in monolingual settings largely generalize to code-switched settings, and (2) some patterns of entrainment on code-switching in dialogue agent-generated text generalize to spontaneous code-switched speech. Our findings give rise to important implications for the potentially "universal" nature of entrainment as a communication phenomenon, and potential applications in inclusive and interactive speech technology.
comment: Edits: camera-ready manuscript for NAACL 2024
♻ ☆ Decode Neural signal as Speech
Decoding language from brain dynamics is an important open direction in the realm of brain-computer interface (BCI), especially considering the rapid growth of large language models. Compared to invasive-based signals which require electrode implantation surgery, non-invasive neural signals (e.g. EEG, MEG) have attracted increasing attention considering their safety and generality. However, the exploration is not adequate in three aspects: 1) previous methods mainly focus on EEG but none of the previous works address this problem on MEG with better signal quality; 2) prior works have predominantly used ``teacher-forcing" during generative decoding, which is impractical; 3) prior works are mostly ``BART-based" not fully auto-regressive, which performs better in other sequence tasks. In this paper, we explore the brain-to-text translation of MEG signals in a speech-decoding formation. Here we are the first to investigate a cross-attention-based ``whisper" model for generating text directly from MEG signals without teacher forcing. Our model achieves impressive BLEU-1 scores of 60.30 and 52.89 without pretraining \& teacher-forcing on two major datasets (\textit{GWilliams} and \textit{Schoffelen}). This paper conducts a comprehensive review to understand how speech decoding formation performs on the neural decoding tasks, including pretraining initialization, training \& evaluation set splitting, augmentation, and scaling law.
♻ ☆ Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs.
♻ ☆ Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning
Leveraging generative Artificial Intelligence (AI), we have transformed a dataset comprising 1,000 scientific papers into an ontological knowledge graph. Through an in-depth structural analysis, we have calculated node degrees, identified communities and connectivities, and evaluated clustering coefficients and betweenness centrality of pivotal nodes, uncovering fascinating knowledge architectures. The graph has an inherently scale-free nature, is highly connected, and can be used for graph reasoning by taking advantage of transitive and isomorphic properties that reveal unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, propose never-before-seen material designs, and predict material behaviors. We compute deep node embeddings for combinatorial node similarity ranking for use in a path sampling strategy links dissimilar concepts that have previously not been related. One comparison revealed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. In another example, the algorithm proposed a hierarchical mycelium-based composite based on integrating path sampling with principles extracted from Kandinsky's 'Composition VII' painting. The resulting material integrates an innovative set of concepts that include a balance of chaos/order, adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across science, technology and art, revealing a nuanced ontology of immanence that reveal a context-dependent heterarchical interplay of constituents. Graph-based generative AI achieves a far higher degree of novelty, explorative capacity, and technical detail, than conventional approaches and establishes a widely useful framework for innovation by revealing hidden connections.
♻ ☆ Unveiling the Pitfalls of Knowledge Editing for Large Language Models ICLR 2024
As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code and data are available at https://github.com/zjunlp/PitfallsKnowledgeEditing.
comment: ICLR 2024
♻ ☆ CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation LREC
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical crawling and post-processing technology. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline, and enriched with document-level genre information via the Transformer-based multilingual X-GENRE classifier, which further enhances comparability at the level of linguistic annotation and metadata enrichment. The genre-focused analysis of the resulting corpora shows a rather consistent distribution of genres throughout the seven corpora, with variations in the most prominent genre categories being well-explained by the economic strength of each language community. A comparison of the distribution of genre categories across the corpora indicates that web corpora from less developed countries primarily consist of news articles. Conversely, web corpora from economically more developed countries exhibit a smaller proportion of news content, with a greater presence of promotional and opinionated texts.
comment: Accepted to the LREC-COLING 2024 conference
♻ ☆ Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration NAACL 2024
Human intelligence thrives on cognitive synergy, where collaboration among different minds yield superior outcomes compared to isolated individuals. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist is an intelligent agent that collaboratively combines multiple minds' strengths and knowledge to enhance problem-solving in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. Our in-depth analysis shows that assigning multiple fine-grained personas in LLMs improves problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, experimental results demonstrate that SPP effectively reduces factual hallucination, and maintains strong reasoning capabilities. Additionally, comparative experiments show that cognitive synergy only emerges in GPT-4 and does not appear in less capable models, such as GPT-3.5-turbo and Llama2-13b-chat, which draws an interesting analogy to human development. Code, data, and prompts can be found at: https://github.com/MikeWangWZHL/Solo-Performance-Prompting.git.
comment: Accepted as a main conference paper at NAACL 2024
♻ ☆ Exploring Representational Disparities Between Multilingual and Bilingual Translation Models LREC
Multilingual machine translation has proven immensely useful for both parameter efficiency and overall performance across many language pairs via complete multilingual parameter sharing. However, some language pairs in multilingual models can see worse performance than in bilingual models, especially in the one-to-many translation setting. Motivated by their empirical differences, we examine the geometric differences in representations from bilingual models versus those from one-to-many multilingual models. Specifically, we compute the isotropy of these representations using intrinsic dimensionality and IsoScore, in order to measure how the representations utilize the dimensions in their underlying vector space. Using the same evaluation data in both models, we find that for a given language pair, its multilingual model decoder representations are consistently less isotropic and occupy fewer dimensions than comparable bilingual model decoder representations. Additionally, we show that much of the anisotropy in multilingual decoder representations can be attributed to modeling language-specific information, therefore limiting remaining representational capacity.
comment: LREC-COLING 2024
♻ ☆ FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction
Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization.
comment: 9 pages, long paper
♻ ☆ Coarse-Tuning for Ad-hoc Document Retrieval Using Pre-trained Language Models LREC
Fine-tuning in information retrieval systems using pre-trained language models (PLM-based IR) requires learning query representations and query-document relations, in addition to downstream task-specific learning. This study introduces coarse-tuning as an intermediate learning stage that bridges pre-training and fine-tuning. By learning query representations and query-document relations in coarse-tuning, we aim to reduce the load of fine-tuning and improve the learning effect of downstream IR tasks. We propose Query-Document Pair Prediction (QDPP) for coarse-tuning, which predicts the appropriateness of query-document pairs. Evaluation experiments show that the proposed method significantly improves MRR and/or nDCG@5 in four ad-hoc document retrieval datasets. Furthermore, the results of the query prediction task suggested that coarse-tuning facilitated learning of query representation and query-document relations.
comment: Accepted at LREC-COLING 2024
♻ ☆ EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation LREC
Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.
comment: Accepted at LREC-Coling 2024
♻ ☆ A Design Space for Intelligent and Interactive Writing Assistants CHI 2024
In our era of rapid technological advancement, the research landscape for writing assistants has become increasingly fragmented across various research communities. We seek to address this challenge by proposing a design space as a structured way to examine and explore the multidimensional space of intelligent and interactive writing assistants. Through a large community collaboration, we explore five aspects of writing assistants: task, user, technology, interaction, and ecosystem. Within each aspect, we define dimensions (i.e., fundamental components of an aspect) and codes (i.e., potential options for each dimension) by systematically reviewing 115 papers. Our design space aims to offer researchers and designers a practical tool to navigate, comprehend, and compare the various possibilities of writing assistants, and aid in the envisioning and design of new writing assistants.
comment: Published as a conference paper at CHI 2024
♻ ☆ BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service LREC
Since the Internet is flooded with hate, it is one of the main tasks for NLP experts to master automated online content moderation. However, advancements in this field require improved access to publicly available accurate and non-synthetic datasets of social media content. For the Polish language, such resources are very limited. In this paper, we address this gap by presenting a new open dataset of offensive social media content for the Polish language. The dataset comprises content from Wykop.pl, a popular online service often referred to as the "Polish Reddit", reported by users and banned in the internal moderation process. It contains a total of 691,662 posts and comments, evenly divided into two categories: "harmful" and "neutral" ("non-harmful"). The anonymized subset of the BAN-PL dataset consisting on 24,000 pieces (12,000 for each class), along with preprocessing scripts have been made publicly available. Furthermore the paper offers valuable insights into real-life content moderation processes and delves into an analysis of linguistic features and content characteristics of the dataset. Moreover, a comprehensive anonymization procedure has been meticulously described and applied. The prevalent biases encountered in similar datasets, including post-moderation and pre-selection biases, are also discussed.
comment: Accepted for LREC-COLING 2024 Conference
♻ ☆ Hyacinth6B: A large language model for Traditional Chinese
This research's primary motivation of this study is to address the high hardware and computational demands typically associated with LLMs.Therefore,our goal is to find a balance between model lightness and performance,striving to maximize performance while using a comparatively lightweight model. Hyacinth6B was developed with this objective in mind,aiming to fully leverage the core capabilities of LLMs without incurring substantial resource costs, effectively pushing the boundaries of smaller model's performance. The training approach involves parameter efficient finetuning using the LoRA method.
comment: 14pages
♻ ☆ Large Language Model for Multi-objective Evolutionary Optimization
Multiobjective evolutionary algorithms (MOEAs) are major methods for solving multiobjective optimization problems (MOPs). Many MOEAs have been proposed in the past decades, of which the search operators need a carefully handcrafted design with domain knowledge. Recently, some attempts have been made to replace the manually designed operators in MOEAs with learning-based operators (e.g., neural network models). However, much effort is still required for designing and training such models, and the learned operators might not generalize well on new problems. To tackle the above challenges, this work investigates a novel approach that leverages the powerful large language model (LLM) to design MOEA operators. With proper prompt engineering, we successfully let a general LLM serve as a black-box search operator for decomposition-based MOEA (MOEA/D) in a zero-shot manner. In addition, by learning from the LLM behavior, we further design an explicit white-box operator with randomness and propose a new version of decomposition-based MOEA, termed MOEA/D-LO. Experimental studies on different test benchmarks show that our proposed method can achieve competitive performance with widely used MOEAs. It is also promising to see the operator only learned from a few instances can have robust generalization performance on unseen problems with quite different patterns and settings. The results reveal the potential benefits of using pre-trained LLMs in the design of MOEAs.To foster reproducibility and accessibility, the source code is https://github.com/FeiLiu36/LLM4MOEA.
♻ ☆ COPR: Continual Learning Human Preference through Optimal Policy Regularization
The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Regularization (COPR), in which we compute the distribution of optimal policy bypassing the partition function and then regularize the current policy based on the historically optimal distribution to mitigate Catastrophic Forgetting (CF). COPR involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data by maintaining a scoring module, similar to reward model, making it flexible for continually learning without human feedback. Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines when it comes to consistently aligning with human preferences on incremental tasks and domains.
♻ ☆ Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark LREC
Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings, unveiling the discourse topic structure of a document. Compared with sentence-level topic structure, the paragraph-level topic structure can quickly grasp and understand the overall context of the document from a higher level, benefitting many downstream tasks such as summarization, discourse parsing, and information retrieval. However, the lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained relative research and applications. To fill this gap, we build the Chinese paragraph-level topic representation, corpus, and benchmark in this paper. Firstly, we propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. Then, we employ a two-stage man-machine collaborative annotation method to construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), achieving high quality. We also build several strong baselines, including ChatGPT, to validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) and preliminarily verified its usefulness for the downstream task (discourse parsing).
comment: Accepted by LREC-COLING 2024
♻ ☆ Spanish Resource Grammar version 2023
We present the latest version of the Spanish Resource Grammar (SRG), a grammar of Spanish implemented in the HPSG formalism. Such grammars encode a complex set of hypotheses about syntax making them a resource for empirical testing of linguistic theory. They also encode a strict notion of grammaticality which makes them a resource for natural language processing applications in computer-assisted language learning. This version of the SRG uses the recent version of the Freeling morphological analyzer and is released along with an automatically created, manually verified treebank of 2,291 sentences. We explain the treebanking process, emphasizing how it is different from treebanking with manual annotation and how it contributes to empirically-driven development of syntactic theory. The treebanks' high level of consistency and detail makes them a resource for training high-quality semantic parsers and generally systems that benefit from precise and detailed semantics. Finally, we present the grammar's coverage and overgeneration on 100 sentences from a learner corpus, a new research line related to developing methodologies for robust empirical evaluation of hypotheses in second language acquisition.
comment: 12 pages, 5 figures
♻ ☆ Motion Generation from Fine-grained Textual Descriptions
The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.
♻ ☆ Tandem Transformers for Inference Efficient LLMs
The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.
♻ ☆ A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
comment: arXiv admin note: text overlap with arXiv:2312.03632
♻ ☆ Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning
Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the pretraining corpus is still empirical and may deviate from the optimal. To address this issue, we systematically analyze the impact of 48 datasets from 5 major categories of pretraining data of LLMs and measure their impacts on LLMs using benchmarks about nine major categories of model capabilities. Our analyses provide empirical results about the contribution of multiple corpora on the performances of LLMs, along with their joint impact patterns, including complementary, orthogonal, and correlational relationships. We also identify a set of ``high-impact data'' such as Books that is significantly related to a set of model capabilities. These findings provide insights into the organization of data to support more efficient pretraining of LLMs.
♻ ☆ High-throughput Biomedical Relation Extraction for Semi-Structured Web Articles Empowered by Large Language Models
Objective: To develop a high-throughput biomedical relation extraction system that takes advantage of the large language models'(LLMs) reading comprehension ability and biomedical world knowledge in a scalable and evidential manner. Methods: We formulate the relation extraction task as binary classifications for large language models. Specifically, LLMs make the decision based on the external corpus and its world knowledge, giving the reason for the judgment for factual verification. This method is tailored for semi-structured web articles, wherein we designate the main title as the tail entity and explicitly incorporate it into the context, and the potential head entities are matched based on a biomedical thesaurus. Moreover, lengthy contents are sliced into text chunks, embedded, and retrieved with additional embedding models. Results: Using an open-source LLM, we extracted 248659 relation triplets of three distinct relation types from three reputable biomedical websites. To assess the efficacy of the basic pipeline employed for biomedical relation extraction, we curated a benchmark dataset annotated by a medical expert. Evaluation results indicate that the pipeline exhibits performance comparable to that of GPT-4. Case studies further illuminate challenges faced by contemporary LLMs in the context of biomedical relation extraction for semi-structured web articles. Conclusion: The proposed method has demonstrated its effectiveness in leveraging the strengths of LLMs for high-throughput biomedical relation extraction. Its adaptability is evident, as it can be seamlessly extended to diverse semi-structured biomedical websites, facilitating the extraction of various types of biomedical relations with ease.
♻ ☆ STEntConv: Predicting Disagreement with Stance Detection and a Signed Graph Convolutional Network LREC
The rise of social media platforms has led to an increase in polarised online discussions, especially on political and socio-cultural topics such as elections and climate change. We propose a simple and novel unsupervised method to predict whether the authors of two posts agree or disagree, leveraging user stances about named entities obtained from their posts. We present STEntConv, a model which builds a graph of users and named entities weighted by stance and trains a Signed Graph Convolutional Network (SGCN) to detect disagreement between comment and reply posts. We run experiments and ablation studies and show that including this information improves disagreement detection performance on a dataset of Reddit posts for a range of controversial subreddit topics, without the need for platform-specific features or user history.
comment: Accepted for the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
♻ ☆ PWESuite: Phonetic Word Embeddings and Tasks They Facilitate LREC
Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.
comment: LREC-COLING 2024
♻ ☆ Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
♻ ☆ Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning ACL 2023
Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -- siamese neural network -- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.
comment: 5 pages, short paper at Findings of ACL 2023
♻ ☆ Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts
In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.
♻ ☆ High-order Joint Constituency and Dependency Parsing LREC
This work revisits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. The original work of Zhou and Zhao (2019) performs joint parsing only at the inference phase. They train two separate parsers under the multi-task learning framework (i.e., one shared encoder and two independent decoders). They design an ad-hoc dynamic programming-based decoding algorithm of $O(n^5)$ time complexity for finding optimal compatible tree pairs. Compared to their work, we make progress in three aspects: (1) adopting a much more efficient decoding algorithm of $O(n^4)$ time complexity, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components to promote constituent-dependency interaction. We conduct experiments and analysis on seven languages, covering both rich-resource and low-resource scenarios. Results and analysis show that joint modeling leads to a modest overall performance boost over separate modeling, but substantially improves the complete matching ratio of whole trees, thanks to the explicit modeling of tree compatibility.
comment: LREC-COLING 2024
♻ ☆ RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict
Fact-checking is the task of verifying the factuality of a given claim by examining the available evidence. High-quality evidence plays a vital role in enhancing fact-checking systems and facilitating the generation of explanations that are understandable to humans. However, the provision of both sufficient and relevant evidence for explainable fact-checking systems poses a challenge. To tackle this challenge, we propose a method based on a Large Language Model to automatically retrieve and summarize evidence from the Web. Furthermore, we construct RU22Fact, a novel multilingual explainable fact-checking dataset on the Russia-Ukraine conflict in 2022 of 16K samples, each containing real-world claims, optimized evidence, and referenced explanation. To establish a baseline for our dataset, we also develop an end-to-end explainable fact-checking system to verify claims and generate explanations. Experimental results demonstrate the prospect of optimized evidence in increasing fact-checking performance and also indicate the possibility of further progress in the end-to-end claim verification and explanation generation tasks.
comment: 12 pages, 3 figures, accepted by lrec-coling2024
♻ ☆ Born With a Silver Spoon? Investigating Socioeconomic Bias in Large Language Models
Socioeconomic bias in society exacerbates disparities, influencing access to opportunities and resources based on individuals' economic and social backgrounds. This pervasive issue perpetuates systemic inequalities, hindering the pursuit of inclusive progress as a society. In this paper, we investigate the presence of socioeconomic bias, if any, in large language models. To this end, we introduce a novel dataset SilverSpoon, consisting of 3000 samples that illustrate hypothetical scenarios that involve underprivileged people performing ethically ambiguous actions due to their circumstances, and ask whether the action is ethically justified. Further, this dataset has a dual-labeling scheme and has been annotated by people belonging to both ends of the socioeconomic spectrum. Using SilverSpoon, we evaluate the degree of socioeconomic bias expressed in large language models and the variation of this degree as a function of model size. We also perform qualitative analysis to analyze the nature of this bias. Our analysis reveals that while humans disagree on which situations require empathy toward the underprivileged, most large language models are unable to empathize with the socioeconomically underprivileged regardless of the situation. To foster further research in this domain, we make SilverSpoon and our evaluation harness publicly available.
♻ ☆ Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding
We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with simple FirstP baselines (applying the same model to input truncated to the first 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated models in the zero-shot scenario as well as after fine-tuning on other collections. In our initial experiments with standard collections we found that long-document models underperformed FirstP or outperformed it by at most 5% on average in terms of MRR or NDCG. We then conjectured that this was not due to models inability to process long context but rather due to a positional bias of relevant passages, which tended to be among the first 512 document tokens. We found evidence that this bias was, indeed, present in at least two test sets, which motivated us to create a new collection MS MARCO FarRelevant where the relevant passages were not present among the first 512 tokens. Unlike standard collections where we observed both little benefit from incorporating longer contexts and limited variability in model performance (within a few %), experiments on MS MARCO FarRelevant uncovered dramatic differences among models. FirstP models performed roughly at the random-baseline level in both zero-shot and fine-tuning scenarios. Simple aggregation models (e.g., MaxP) had good zero-shot accuracy but benefited little from fine-tuning. Most other models had poor zero-shot performance (sometimes at a random baseline level) but outstripped MaxP by as much 13-28\% after finetuning. Thus, positional bias not only diminishes benefits of processing longer document contexts but also leads to model overfitting to this bias and performing poorly in a zero-shot setting when a distribution of relevant passages changes substantially. We make our software and MS MARCO FarRelevant available.
♻ ☆ Topic Detection and Tracking with Time-Aware Document Embeddings LREC
The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.
comment: Accepted to LREC-COLING 2024
♻ ☆ GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher ICLR 2024
Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.
comment: Accepted by ICLR 2024. 21 pages, 3 figures, 13 tables
♻ ☆ Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction COLING 2024
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction, i.e., prompts tend to introduce biases toward specific labels. Prompt bias presents a significant challenge in assessing the factual knowledge within PLMs. Therefore, this paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias. We show that: 1) all prompts in the experiments exhibit non-negligible bias, with gradient-based prompts like AutoPrompt and OptiPrompt displaying significantly higher levels of bias; 2) prompt bias can amplify benchmark accuracy unreasonably by overfitting the test datasets, especially on imbalanced datasets like LAMA. Based on these findings, we propose a representation-based approach to mitigate the prompt bias during inference time. Specifically, we first estimate the biased representation using prompt-only querying, and then remove it from the model's internal representations to generate the debiased representations, which are used to produce the final debiased outputs. Experiments across various prompts, PLMs, and benchmarks show that our approach can not only correct the overfitted performance caused by prompt bias, but also significantly improve the prompt retrieval capability (up to 10% absolute performance gain). These results indicate that our approach effectively alleviates prompt bias in knowledge evaluation, thereby enhancing the reliability of benchmark assessments. Hopefully, our plug-and-play approach can be a golden standard to strengthen PLMs toward reliable knowledge bases. Code and data are released in https://github.com/FelliYang/PromptBias.
comment: Accepted by COLING 2024
♻ ☆ Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences
Hypothesis formulation and testing are central to empirical research. A strong hypothesis is a best guess based on existing evidence and informed by a comprehensive view of relevant literature. However, with exponential increase in the number of scientific articles published annually, manual aggregation and synthesis of evidence related to a given hypothesis is a challenge. Our work explores the ability of current large language models (LLMs) to discern evidence in support or refute of specific hypotheses based on the text of scientific abstracts. We share a novel dataset for the task of scientific hypothesis evidencing using community-driven annotations of studies in the social sciences. We compare the performance of LLMs to several state-of-the-art benchmarks and highlight opportunities for future research in this area. The dataset is available at https://github.com/Sai90000/ScientificHypothesisEvidencing.git
♻ ☆ Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning CVPR 2024
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution, in this work, we propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. The presented Troika is our implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations, we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings. The code will be available at https://github.com/bighuang624/Troika.
comment: CVPR 2024
♻ ☆ Towards a RAG-based Summarization Agent for the Electron-Ion Collider
The complexity and sheer volume of information encompassing documents, papers, data, and other resources from large-scale experiments demand significant time and effort to navigate, making the task of accessing and utilizing these varied forms of information daunting, particularly for new collaborators and early-career scientists. To tackle this issue, a Retrieval Augmented Generation (RAG)--based Summarization AI for EIC (RAGS4EIC) is under development. This AI-Agent not only condenses information but also effectively references relevant responses, offering substantial advantages for collaborators. Our project involves a two-step approach: first, querying a comprehensive vector database containing all pertinent experiment information; second, utilizing a Large Language Model (LLM) to generate concise summaries enriched with citations based on user queries and retrieved data. We describe the evaluation methods that use RAG assessments (RAGAs) scoring mechanisms to assess the effectiveness of responses. Furthermore, we describe the concept of prompt template-based instruction-tuning which provides flexibility and accuracy in summarization. Importantly, the implementation relies on LangChain, which serves as the foundation of our entire workflow. This integration ensures efficiency and scalability, facilitating smooth deployment and accessibility for various user groups within the Electron Ion Collider (EIC) community. This innovative AI-driven framework not only simplifies the understanding of vast datasets but also encourages collaborative participation, thereby empowering researchers. As a demonstration, a web application has been developed to explain each stage of the RAG Agent development in detail.
comment: updated title to have no latex formatting
♻ ☆ AIOS: LLM Agent Operating System
The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating heterogeneous agents with different capabilities and specializations. The rapid increase of agent quantity and complexity further exacerbates these issues, often leading to bottlenecks and sub-optimal utilization of resources. Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. Specifically, AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, and maintain access control for agents. We present the architecture of such an operating system, outline the core challenges it aims to resolve, and provide the basic design and implementation of the AIOS. Our experiments on concurrent execution of multiple agents demonstrate the reliability and efficiency of our AIOS modules. Through this, we aim to not only improve the performance and efficiency of LLM agents but also to pioneer for better development and deployment of the AIOS ecosystem in the future. The project is open-source at https://github.com/agiresearch/AIOS.
comment: 14 pages, 5 figures, 5 tables; comments and suggestions are appreciated
♻ ☆ Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human judgement, revealing that existing calibration methods aimed at mitigating biases are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts. PairS achieves state-of-the-art performance on representative evaluation tasks and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PairS benefits from calibration.
♻ ☆ Large Language Models in Biomedical and Health Informatics: A Bibliometric Review
Large Language Models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), enabling new ways to analyze data, treat patients, and conduct research. This bibliometric review aims to provide a panoramic view of how LLMs have been used in BHI by examining research articles and collaboration networks from 2022 to 2023. It further explores how LLMs can improve Natural Language Processing (NLP) applications in various BHI areas like medical diagnosis, patient engagement, electronic health record management, and personalized medicine. To do this, our bibliometric review identifies key trends, maps out research networks, and highlights major developments in this fast-moving field. Lastly, it discusses the ethical concerns and practical challenges of using LLMs in BHI, such as data privacy and reliable medical recommendations. Looking ahead, we consider how LLMs could further transform biomedical research as well as healthcare delivery and patient outcomes. This bibliometric review serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
comment: 50 pages, 7 figures, 4 tables
♻ ☆ First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.
♻ ☆ Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic COLING 2024
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic. The implementation code for LoT can be accessed at: https://github.com/xf-zhao/LoT.
comment: Accepted in COLING 2024. Code see https://github.com/xf-zhao/LoT
♻ ☆ Do large language models resemble humans in language use?
Large language models (LLMs) such as ChatGPT and Vicuna have shown remarkable capacities in comprehending and producing language. However, their internal workings remain a black box, and it is unclear whether LLMs and chatbots can develop humanlike characteristics in language use. Cognitive scientists have devised many experiments that probe, and have made great progress in explaining, how people comprehend and produce language. We subjected ChatGPT and Vicuna to 12 of these experiments ranging from sounds to dialogue, preregistered and with 1000 runs (i.e., iterations) per experiment. ChatGPT and Vicuna replicated the human pattern of language use in 10 and 7 out of the 12 experiments, respectively. The models associated unfamiliar words with different meanings depending on their forms, continued to access recently encountered meanings of ambiguous words, reused recent sentence structures, attributed causality as a function of verb semantics, and accessed different meanings and retrieved different words depending on an interlocutor's identity. In addition, ChatGPT, but not Vicuna, nonliterally interpreted implausible sentences that were likely to have been corrupted by noise, drew reasonable inferences, and overlooked semantic fallacies in a sentence. Finally, unlike humans, neither model preferred using shorter words to convey less informative content, nor did they use context to resolve syntactic ambiguities. We discuss how these convergences and divergences may result from the transformer architecture. Overall, these experiments demonstrate that LLMs such as ChatGPT (and Vicuna to a lesser extent) are humanlike in many aspects of human language processing.
♻ ☆ SelfIE: Self-Interpretation of Large Language Model Embeddings
How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.
Software Engineering
☆ MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
In software evolution, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing functionalities. Large Language Models (LLMs) have shown promise in code generation and understanding but face difficulties in code change, particularly at the repository level. To overcome these challenges, we empirically study the reason why LLMs mostly fail to resolve GitHub issues and analyze some impact factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four kinds of agents customized for the software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, which significantly outperforms the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the based LLM of our method. We also analyze the factors for improving GitHub issue resolution rates, such as line location, task allocation, etc.
comment: work in progress
☆ Proceedings Sixth Workshop on Models for Formal Analysis of Real Systems
This volume contains the proceedings of MARS 2024, the sixth workshop on Models for Formal Analysis of Real Systems, held as part of ETAPS 2024, the European Joint Conferences on Theory and Practice of Software. The MARS workshops bring together researchers from different communities who are developing formal models of real systems in areas where complex models occur, such as networks, cyber-physical systems, hardware/software co-design, biology, etc. The motivation and aim for MARS stem from the following two observations: (1) Large case studies are essential to show that specification formalisms and modelling techniques are applicable to real systems, whereas many research papers only consider toy examples or tiny case studies. (2) Developing an accurate model of a real system takes a large amount of time, often months or years. In most scientific papers, however, salient details of the model need to be skipped due to lack of space, and to leave room for formal verification methodologies and results. The MARS workshops aim at remedying these issues, emphasising modelling over verification, so as to retain lessons learnt from formal modelling, which are not usually discussed elsewhere.
☆ SPES: Towards Optimizing Performance-Resource Trade-Off for Serverless Functions ICDE 2024
As an emerging cloud computing deployment paradigm, serverless computing is gaining traction due to its efficiency and ability to harness on-demand cloud resources. However, a significant hurdle remains in the form of the cold start problem, causing latency when launching new function instances from scratch. Existing solutions tend to use over-simplistic strategies for function pre-loading/unloading without full invocation pattern exploitation, rendering unsatisfactory optimization of the trade-off between cold start latency and resource waste. To bridge this gap, we propose SPES, the first differentiated scheduler for runtime cold start mitigation by optimizing serverless function provision. Our insight is that the common architecture of serverless systems prompts the con- centration of certain invocation patterns, leading to predictable invocation behaviors. This allows us to categorize functions and pre-load/unload proper function instances with finer-grained strategies based on accurate invocation prediction. Experiments demonstrate the success of SPES in optimizing serverless function provision on both sides: reducing the 75th-percentile cold start rates by 49.77% and the wasted memory time by 56.43%, compared to the state-of-the-art. By mitigating the cold start issue, SPES is a promising advancement in facilitating cloud services deployed on serverless architectures.
comment: 12 pages, accepted by ICDE 2024 (40th IEEE International Conference on Data Engineering)
☆ Natural Language Requirements Testability Measurement Based on Requirement Smells
Requirements form the basis for defining software systems' obligations and tasks. Testable requirements help prevent failures, reduce maintenance costs, and make it easier to perform acceptance tests. However, despite the importance of measuring and quantifying requirements testability, no automatic approach for measuring requirements testability has been proposed based on the requirements smells, which are at odds with the requirements testability. This paper presents a mathematical model to evaluate and rank the natural language requirements testability based on an extensive set of nine requirements smells, detected automatically, and acceptance test efforts determined by requirement length and its application domain. Most of the smells stem from uncountable adjectives, context-sensitive, and ambiguous words. A comprehensive dictionary is required to detect such words. We offer a neural word-embedding technique to generate such a dictionary automatically. Using the dictionary, we could automatically detect Polysemy smell (domain-specific ambiguity) for the first time in 10 application domains. Our empirical study on nearly 1000 software requirements from six well-known industrial and academic projects demonstrates that the proposed smell detection approach outperforms Smella, a state-of-the-art tool, in detecting requirements smells. The precision and recall of smell detection are improved with an average of 0.03 and 0.33, respectively, compared to the state-of-the-art. The proposed requirement testability model measures the testability of 985 requirements with a mean absolute error of 0.12 and a mean squared error of 0.03, demonstrating the model's potential for practical use.
comment: 45 pages, 16 figures, and 13 tables; submitted as a journal paper
☆ An Empirical Study of ChatGPT-related projects on GitHub
As ChatGPT possesses powerful capabilities in natural language processing and code analysis, it has received widespread attention since its launch. Developers have applied its powerful capabilities to various domains through software projects which are hosted on the largest open-source platform (GitHub) worldwide. Simultaneously, these projects have triggered extensive discussions. In order to comprehend the research content of these projects and understand the potential requirements discussed, we collected ChatGPT-related projects from the GitHub platform and utilized the LDA topic model to identify the discussion topics. Specifically, we selected 200 projects, categorizing them into three primary categories through analyzing their descriptions: ChatGPT implementation & training, ChatGPT application, ChatGPT improvement & extension. Subsequently, we employed the LDA topic model to identify 10 topics from issue texts, and compared the distribution and evolution trend of the discovered topics within the three primary project categories. Our observations include (1) The number of projects growing in a single month for the three primary project categories are closely associated with the development of ChatGPT. (2) There exist significant variations in the popularity of each topic for the three primary project categories. (3) The monthly changes in the absolute impact of each topic for the three primary project categories are diverse, which is often closely associated with the variation in the number of projects owned by that category. (4) With the passage of time, the relative impact of each topic exhibits different development trends in the three primary project categories. Based on these findings, we discuss implications for developers and users.
☆ Characterizing Dependency Update Practice of NPM, PyPI and Cargo Packages
Keeping dependencies up-to-date prevents software supply chain attacks through outdated and vulnerable dependencies. Developers may use packages' dependency update practice as one of the selection criteria for choosing a package as a dependency. However, the lack of metrics characterizing packages' dependency update practice makes this assessment difficult. To measure the up-to-date characteristics of packages, we focus on the dependency management aspect and propose two update metrics: Time-Out-Of-Date (TOOD) and Post-Fix-Exposure-Time (PFET), to measure the updatedness of dependencies and updatedness of vulnerable dependencies, respectively. We design an algorithm to stabilize the dependency relationships in different time intervals and compute the proposed metrics for each package. Using our proposed metrics, we conduct a large-scale empirical study of update metrics with 2.9M packages, 66.8M package versions, and 26.8M unique package-dependency relations in NPM, PyPI, and Cargo, ranging from the year 2004 to 2023. We analyze the characteristics of the proposed metrics for capturing packages' dependency update practice in the three ecosystems. Given that the TOOD metric generates a greater volume of data than the PFET metric, we further explore the numerical relationship between these metrics to assess their potential as substitutes for vulnerability counts metrics. We find that PyPI packages update dependencies faster than NPM and Cargo. Conversely, Cargo packages update their vulnerable dependencies faster than NPM and PyPI. We also find that the general purpose update metric, TOOD, can be a proxy for the security-focused update metric, PFET.
comment: currently under review
☆ MESIA: Understanding and Leveraging Supplementary Nature of Method-level Comments for Automatic Comment Generation
Code comments are important for developers in program comprehension. In scenarios of comprehending and reusing a method, developers expect code comments to provide supplementary information beyond the method signature. However, the extent of such supplementary information varies a lot in different code comments. In this paper, we raise the awareness of the supplementary nature of method-level comments and propose a new metric named MESIA (Mean Supplementary Information Amount) to assess the extent of supplementary information that a code comment can provide. With the MESIA metric, we conduct experiments on a popular code-comment dataset and three common types of neural approaches to generate method-level comments. Our experimental results demonstrate the value of our proposed work with a number of findings. (1) Small-MESIA comments occupy around 20% of the dataset and mostly fall into only the WHAT comment category. (2) Being able to provide various kinds of essential information, large-MESIA comments in the dataset are difficult for existing neural approaches to generate. (3) We can improve the capability of existing neural approaches to generate large-MESIA comments by reducing the proportion of small-MESIA comments in the training set. (4) The retrained model can generate large-MESIA comments that convey essential meaningful supplementary information for methods in the small-MESIA test set, but will get a lower BLEU score in evaluation. These findings indicate that with good training data, auto-generated comments can sometimes even surpass human-written reference comments, and having no appropriate ground truth for evaluation is an issue that needs to be addressed by future work on automatic comment generation.
comment: In 32nd IEEE/ACM International Conference on Program Comprehension (ICPC'24)
♻ ☆ FastFlip: Compositional Error Injection Analysis
Instruction-level error injection analyses aim to find instructions where errors often lead to unacceptable outcomes like Silent Data Corruptions (SDCs). These analyses require significant time, which is especially problematic if developers wish to regularly analyze software that evolves over time. We present FastFlip, a combination of empirical error injection and symbolic SDC propagation analyses that enables fast, compositional error injection analysis of evolving programs. FastFlip calculates how SDCs propagate across program sections and correctly accounts for unexpected side effects that can occur due to errors. Using FastFlip, we analyze five benchmarks, plus two modified versions of each benchmark. FastFlip speeds up the analysis of incrementally modified programs by $3.2\times$ (geomean). FastFlip selects a set of instructions to protect against SDCs that minimizes the runtime cost of protection while protecting against a developer-specified target fraction of all SDC-causing errors.
♻ ☆ Mining Architectural Information: A Systematic Mapping Study
Mining Software Repositories (MSR) has become an essential activity in software development. Mining architectural information to support architecting activities, such as architecture understanding, has received significant attention in recent years. However, there is a lack of clarity on what literature on mining architectural information is available. Consequently, this may create difficulty for practitioners to understand and adopt the state-of-the-art research results, such as what approaches should be adopted to mine what architectural information in order to support architecting activities. It also hinders researchers from being aware of the challenges and remedies for the identified research gaps. We aim to identify, analyze, and synthesize the literature on mining architectural information in terms of architectural information and sources mined, architecting activities supported, approaches and tools used, and challenges faced. An SMS has been conducted on the literature published between January 2006 and December 2022. Of the 104 primary studies selected, 7 categories of architectural information have been mined, among which architectural description is the most mined architectural information; 11 categories of sources have been leveraged for mining architectural information, among which version control system is the most popular source; 11 architecting activities can be supported by the mined architectural information, among which architecture understanding is the most supported activity; 95 approaches and 56 tools were proposed and employed in mining architectural information; and 4 types of challenges in mining architectural information were identified. This SMS provides researchers with future directions and help practitioners be aware of what approaches and tools can be used to mine what architectural information from what sources to support various architecting activities.
comment: Preprint accepted for publication in Empirical Software Engineering, 2024
♻ ☆ Toward a Theory of Causation for Interpreting Neural Code Models
Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs.
comment: Accepted to appear in IEEE Transactions on Software Engineering
♻ ☆ Tackling Long Code Search with Splitting, Encoding, and Aggregating LREC
Code search with natural language helps us reuse existing code snippets. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. To tackle the long code problem, we propose a new baseline SEA (Split, Encode and Aggregate), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models to model long code without changing their internal structure and re-pretraining. We also compare SEA with sparse Trasnformer methods. With GraphCodeBERT as the encoder, SEA achieves an overall mean reciprocal ranking score of 0.785, which is 10.1% higher than GraphCodeBERT on the CodeSearchNet benchmark, justifying SEA as a strong baseline for long code search. Our source code and experimental data are available at: https://github.com/fly-dragon211/SEA.
comment: Accepted by LREC-COLING 2024
♻ ☆ ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation
System Verilog Assertion (SVA) formulation- a critical yet complex task is a prerequisite in the Formal Property Verification (FPV) process. Traditionally, SVA formulation involves expert-driven interpretation of specifications, which is timeconsuming and prone to human error. However, LLM-informed automatic assertion generation is gaining interest. We designeda novel framework called ChIRAAG, based on OpenAI GPT4, to generate SVA assertions from natural language specifications. ChIRAAG constitutes the systematic breakdown of design specifications into a standardized format, further generating assertions from formatted specifications using LLM. Furthermore, we developed testbenches to verify/validate the LLM-generated assertions. Automatic feedback of log files from the simulation tool to the LLM ensures that the framework can generate correc SVAs automatically. Only 33% of LLM-generated raw assertions had errors. Our results on OpenTitan designs shows that LLMs can streamline and assist engineers in the assertion generation process, reshaping verification workflows.
comment: 6 pages, 5 figures and 2 table
♻ ☆ FedCSD: A Federated Learning Based Approach for Code-Smell Detection
This paper proposes a Federated Learning Code Smell Detection (FedCSD) approach that allows organizations to collaboratively train federated ML models while preserving their data privacy. These assertions have been supported by three experiments that have significantly leveraged three manually validated datasets aimed at detecting and examining different code smell scenarios. In experiment 1, which was concerned with a centralized training experiment, dataset two achieved the lowest accuracy (92.30%) with fewer smells, while datasets one and three achieved the highest accuracy with a slight difference (98.90% and 99.5%, respectively). This was followed by experiment 2, which was concerned with cross-evaluation, where each ML model was trained using one dataset, which was then evaluated over the other two datasets. Results from this experiment show a significant drop in the model's accuracy (lowest accuracy: 63.80\%) where fewer smells exist in the training dataset, which has a noticeable reflection (technical debt) on the model's performance. Finally, the last and third experiments evaluate our approach by splitting the dataset into 10 companies. The ML model was trained on the company's site, then all model-updated weights were transferred to the server. Ultimately, an accuracy of 98.34% was achieved by the global model that has been trained using 10 companies for 100 training rounds. The results reveal a slight difference in the global model's accuracy compared to the highest accuracy of the centralized model, which can be ignored in favour of the global model's comprehensive knowledge, lower training cost, preservation of data privacy, and avoidance of the technical debt problem.
comment: 17 pages, 7 figures, Journal paper
♻ ☆ DISL: Fueling Research with A Large Dataset of Solidity Smart Contracts
The DISL dataset features a collection of $514,506$ unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts. By aggregating every verified smart contract from Etherscan up to January 15, 2024, DISL surpasses existing datasets in size and recency.
♻ ☆ Scalable and Precise Application-Centered Call Graph Construction for Python
Call graph construction is the foundation of inter-procedural static analysis. PYCG is the state-of-the-art approach for constructing call graphs for Python programs. Unfortunately, PyCG does not scale to large programs when adapted to whole-program analysis where application and dependent libraries are both analyzed. Moreover, PyCG is flow-insensitive and does not fully support Python's features, hindering its accuracy. To overcome these drawbacks, we propose a scalable and precise approach for constructing application-centered call graphs for Python programs, and implement it as a prototype tool JARVIS. JARVIS maintains a type graph (i.e., type relations of program identifiers) for each function in a program to allow type inference. Taking one function as an input, JARVIS generates the call graph on-the-fly, where flow-sensitive intra-procedural analysis and inter-procedural analysis are conducted in turn and strong updates are conducted. Our evaluation on a micro-benchmark of 135 small Python programs and a macro-benchmark of 6 real-world Python applications has demonstrated that JARVIS can significantly improve PYCG by at least 67% faster in time, 84% higher in precision, and at least 20% higher in recall.
comment: 13 pages
Programming Languages
☆ Proceedings Sixth Workshop on Models for Formal Analysis of Real Systems
This volume contains the proceedings of MARS 2024, the sixth workshop on Models for Formal Analysis of Real Systems, held as part of ETAPS 2024, the European Joint Conferences on Theory and Practice of Software. The MARS workshops bring together researchers from different communities who are developing formal models of real systems in areas where complex models occur, such as networks, cyber-physical systems, hardware/software co-design, biology, etc. The motivation and aim for MARS stem from the following two observations: (1) Large case studies are essential to show that specification formalisms and modelling techniques are applicable to real systems, whereas many research papers only consider toy examples or tiny case studies. (2) Developing an accurate model of a real system takes a large amount of time, often months or years. In most scientific papers, however, salient details of the model need to be skipped due to lack of space, and to leave room for formal verification methodologies and results. The MARS workshops aim at remedying these issues, emphasising modelling over verification, so as to retain lessons learnt from formal modelling, which are not usually discussed elsewhere.
☆ CSSTs: A Dynamic Data Structure for Partial Orders in Concurrent Execution Analysis
Dynamic analyses are a standard approach to analyzing and testing concurrent programs. Such techniques observe program traces and analyze them to infer the presence or absence of bugs. At its core, each analysis maintains a partial order $P$ that represents order dependencies between events of the analyzed trace $\sigma$. Naturally, the scalability of the analysis largely depends on how efficiently it maintains $P$. The standard data structure for this task has thus far been vector clocks. These, however, are slow for analyses that follow a non-streaming style, costing $O(n)$ for inserting (and propagating) each new ordering in $P$, where $n$ is the size of $\sigma$, while they cannot handle the deletion of existing orderings. In this paper we develop collective sparse segment trees (CSSTs), a simple but elegant data structure for generically maintaining a partial order $P$. CSSTs thrive when the width $k$ of $P$ is much smaller than the size $n$ of its domain, allowing inserting, deleting, and querying for orderings in $P$ to run in $O(logn)$ time. For a concurrent trace, $k$ is bounded by the number of its threads, and is normally orders of magnitude smaller than its size $n$, making CSSTs fitting for this setting. Our experimental results confirm that CSSTs are the best data structure currently to handle a range of dynamic analyses from existing literature.
☆ Piecewise Linear Expectation Analysis via $k$-Induction for Probabilistic Programs
Quantitative analysis of probabilistic programs aims at deriving tight numerical bounds for probabilistic properties such as expectation and assertion probability, and plays a crucial role in the verification of probabilistic programs. Along this line of research, most existing works consider numerical bounds over the whole state space monolithically and do not consider piecewise bounds. Clearly, monolithic bounds are either conservative, or not expressive and succinct enough in general. To derive more succinct, expressive and precise numerical bounds for probabilistic properties, we propose a novel approach for synthesizing piecewise linear bounds in this work. To this end, we first show how to extract a piecewise feature w.r.t. a given quantitative property from a probabilistic program using latticed $k$-induction that captures a wide and representative class of piecewise bound functions. Second, we develop an algorithmic approach to synthesize piecewise linear upper and lower bounds from the piecewise feature, for which we show that the synthesis of piecewise linear bounds can be reduced to bilinear programming. Third, we implement our approach with the bilinear programming solver Gurobi. The experimental results indicate that our approach is capable of generating tight or even accurate piecewise linear bounds for an extensive set of benchmarks compared with the state of the art.
☆ Java Classes with "-Er" and "-Utils" Suffixes Have Higher Complexity
In object-oriented programming languages, a belief exists that classes with -Er/-Or and -Utils suffixes are "code smells" because they take over a lot of functional responsibility, turning out to be bulky and complicated, and therefore making it more difficult to maintain the code. In order to validate this intuition, we analyzed complexity and cohesion of 13,861 Java classes from 212 unique open-source GitHub repositories. We found out that average values of Cyclomatic Complexity and Cognitive Complexity metrics are at least 2.5 times higher when suffixes are present.
♻ ☆ Quantum Control Machine: The Limits of Control Flow in Quantum Programming
Quantum algorithms for tasks such as factorization, search, and simulation rely on control flow such as branching and iteration that depends on the value of data in superposition. High-level programming abstractions for control flow, such as switches, loops, and higher-order functions, are ubiquitous in classical languages. By contrast, many quantum languages do not provide high-level abstractions for control flow in superposition, and instead require the use of hardware-level logic gates to implement such control flow. The reason for this gap is that whereas a classical computer supports control flow using a program counter that can depend on data, the typical architecture of a quantum computer does not provide a program counter that can depend on data in superposition. As a result, the complete set of control flow abstractions that can be correctly realized on a quantum computer has not yet been established. In this work, we provide a complete characterization of the properties of control flow abstractions that are correctly realizable on a quantum computer. First, we prove that even on a quantum computer whose program counter exists in superposition, one cannot correctly realize control flow in quantum algorithms by lifting the classical conditional jump instruction to work in superposition. This theorem denies the ability to directly lift general abstractions for control flow such as the $\lambda$-calculus from classical to quantum programming. In response, we present the necessary and sufficient conditions for control flow to be correctly realizable on a quantum computer. We introduce the quantum control machine, an instruction set architecture featuring a conditional jump that is restricted to satisfy these conditions. We show how this design enables a developer to correctly express control flow in quantum algorithms using a program counter in place of logic gates.
comment: 24 pages, 10 figures. v5: added funding acknowledgements. v4: camera-ready version. v3: added examples and improved organization of paper. v2: switched LaTeX template, improved descriptive text and added more discussion of implications
Formal Languages and Automata Theory
♻ ☆ Probabilistic automatic complexity of finite strings
We introduce a new complexity measure for finite strings using probabilistic finite-state automata (PFAs), in the same spirit as existing notions employing DFAs and NFAs, and explore its properties. The PFA complexity $A_P(x)$ is the least number of states of a PFA for which $x$ is the most likely string of its length to be accepted. The variant $A_{P,\delta}(x)$ adds a real-valued parameter $\delta$ specifying a lower bound on the gap in acceptance probabilities between $x$ and other strings. We relate $A_P$ to the DFA and NFA complexities, obtain a complete classification of binary strings with $A_P=2$, and prove $A_{P,\delta}(x)$ is computable for every $x$ and for cofinitely many $\delta$ (depending on $x$). Finally, we discuss several other variations on $A_P$ with a view to obtaining additional desirable properties.
comment: 41 pages, 5 figures. This work extends Chapter 2 of the author's PhD dissertation at Penn State. V2: fix statement of Proposition 3.4
Symbolic Computation
♻ ☆ Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic COLING 2024
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic. The implementation code for LoT can be accessed at: https://github.com/xf-zhao/LoT.
comment: Accepted in COLING 2024. Code see https://github.com/xf-zhao/LoT
Computer Vision and Pattern Recognition
☆ Efficient Video Object Segmentation via Modulated Cross-Attention Memory
Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.
☆ ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis CVPR 2024
Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.
comment: CVPR 2024. Project Page: https://vcai.mpi-inf.mpg.de/projects/ConvoFusion/
☆ OmniVid: A Generative Framework for Universal Video Understanding CVPR 2024
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training corpora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid.
comment: Accepted by CVPR 2024
☆ AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation
Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via cropping, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody.
comment: Homepage: https://ttxskk.github.io/AiOS/
☆ SLEDGE: Synthesizing Simulation Environments for Driving Agents with Generative Models
SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder (RVAE). It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500$\times$ less storage to set up (<4GB), making it a more accessible option and helping with democratizing future research in this field.
☆ Track Everything Everywhere Fast and Robustly
We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.
comment: project page: https://timsong412.github.io/FastOmniTrack/
☆ Towards Explaining Hypercomplex Neural Networks
Hypercomplex neural networks are gaining increasing interest in the deep learning community. The attention directed towards hypercomplex models originates from several aspects, spanning from purely theoretical and mathematical characteristics to the practical advantage of lightweight models over conventional networks, and their unique properties to capture both global and local relations. In particular, a branch of these architectures, parameterized hypercomplex neural networks (PHNNs), has also gained popularity due to their versatility across a multitude of application domains. Nonetheless, only few attempts have been made to explain or interpret their intricacies. In this paper, we propose inherently interpretable PHNNs and quaternion-like networks, thus without the need for any post-hoc method. To achieve this, we define a type of cosine-similarity transform within the parameterized hypercomplex domain. This PHB-cos transform induces weight alignment with relevant input features and allows to reduce the model into a single linear transform, rendering it directly interpretable. In this work, we start to draw insights into how this unique branch of neural models operates. We observe that hypercomplex networks exhibit a tendency to concentrate on the shape around the main object of interest, in addition to the shape of the object itself. We provide a thorough analysis, studying single neurons of different layers and comparing them against how real-valued networks learn. The code of the paper is available at https://github.com/ispamm/HxAI.
comment: The paper has been accepted at IEEE WCCI 2024
☆ FastCAR: Fast Classification And Regression Multi-Task Learning via Task Consolidation for Modelling a Continuous Property Variable of Object Classes
FastCAR is a novel task consolidation approach in Multi-Task Learning (MTL) for a classification and a regression task, despite task heterogeneity with only subtle correlation. It addresses object classification and continuous property variable regression, a crucial use case in science and engineering. FastCAR involves a labeling transformation approach that can be used with a single-task regression network architecture. FastCAR outperforms traditional MTL model families, parametrized in the landscape of architecture and loss weighting schemes, when learning of both tasks are collectively considered (classification accuracy of 99.54%, regression mean absolute percentage error of 2.3%). The experiments performed used an Advanced Steel Property dataset contributed by us. The dataset comprises 4536 images of 224x224 pixels, annotated with object classes and hardness properties that take continuous values. With the labeling transformation and single-task regression network architecture, FastCAR achieves reduced latency and time efficiency.
☆ AID: Attention Interpolation of Text-to-Image Diffusion
Conditional diffusion models can create unseen images in various settings, aiding image interpolation. Interpolation in latent spaces is well-studied, but interpolation with specific conditions like text or poses is less understood. Simple approaches, such as linear interpolation in the space of conditions, often result in images that lack consistency, smoothness, and fidelity. To that end, we introduce a novel training-free technique named Attention Interpolation via Diffusion (AID). Our key contributions include 1) proposing an inner/outer interpolated attention layer; 2) fusing the interpolated attention with self-attention to boost fidelity; and 3) applying beta distribution to selection to increase smoothness. We also present a variant, Prompt-guided Attention Interpolation via Diffusion (PAID), that considers interpolation as a condition-dependent generative process. This method enables the creation of new images with greater consistency, smoothness, and efficiency, and offers control over the exact path of interpolation. Our approach demonstrates effectiveness for conceptual and spatial interpolation. Code and demo are available at https://github.com/QY-H00/attention-interpolation-diffusion.
☆ TC4D: Trajectory-Conditioned Text-to-4D Generation
Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: https://sherwinbahmani.github.io/tc4d.
comment: Project Page: https://sherwinbahmani.github.io/tc4d
☆ CMP: Cooperative Motion Prediction with Multi-Agent Communication
The confluence of the advancement of Autonomous Vehicles (AVs) and the maturity of Vehicle-to-Everything (V2X) communication has enabled the capability of cooperative connected and automated vehicles (CAVs). Building on top of cooperative perception, this paper explores the feasibility and effectiveness of cooperative motion prediction. Our method, CMP, takes LiDAR signals as input to enhance tracking and prediction capabilities. Unlike previous work that focuses separately on either cooperative perception or motion prediction, our framework, to the best of our knowledge, is the first to address the unified problem where CAVs share information in both perception and prediction modules. Incorporated into our design is the unique capability to tolerate realistic V2X bandwidth limitations and transmission delays, while dealing with bulky perception representations. We also propose a prediction aggregation module, which unifies the predictions obtained by different CAVs and generates the final prediction. Through extensive experiments and ablation studies, we demonstrate the effectiveness of our method in cooperative perception, tracking, and motion prediction tasks. In particular, CMP reduces the average prediction error by 17.2\% with fewer missing detections compared with the no cooperation setting. Our work marks a significant step forward in the cooperative capabilities of CAVs, showcasing enhanced performance in complex scenarios.
☆ Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos
Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues. Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects. In this paper, we utilize the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation. We first create two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. We then propose a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation. Finally, we introduce teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. We achieve state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data. Our code, pre-trained models, and supplementary materials can be found on our project page: https://ppsnet.github.io/
comment: 26 pages, 7 tables, 7 figures
☆ ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an Efficient Local-Global Context Aggregator module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance. Project url https://github.com/techmn/elgcnet.
comment: accepted at IEEE TGRS
☆ Scalable Non-Cartesian Magnetic Resonance Imaging with R2D2
We propose a new approach for non-Cartesian magnetic resonance image reconstruction. While unrolled architectures provide robustness via data-consistency layers, embedding measurement operators in Deep Neural Network (DNN) can become impractical at large scale. Alternative Plug-and-Play (PnP) approaches, where the denoising DNNs are blind to the measurement setting, are not affected by this limitation and have also proven effective, but their highly iterative nature also affects scalability. To address this scalability challenge, we leverage the "Residual-to-Residual DNN series for high-Dynamic range imaging (R2D2)" approach recently introduced in astronomical imaging. R2D2's reconstruction is formed as a series of residual images, iteratively estimated as outputs of DNNs taking the previous iteration's image estimate and associated data residual as inputs. The method can be interpreted as a learned version of the Matching Pursuit algorithm. We demonstrate R2D2 in simulation, considering radial k-space sampling acquisition sequences. Our preliminary results suggest that R2D2 achieves: (i) suboptimal performance compared to its unrolled incarnation R2D2-Net, which is however non-scalable due to the necessary embedding of NUFFT-based data-consistency layers; (ii) superior reconstruction quality to a scalable version of R2D2-Net embedding an FFT-based approximation for data consistency; (iii) superior reconstruction quality to PnP, while only requiring few iterations.
comment: submitted to IEEE EUSIPCO 2024
☆ Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models
The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters are inherently local and therefore struggle at modeling long-range dependencies in images. On the other hand, attention excels at capturing global interactions between arbitrary image regions, however at a quadratic cost in image dimension. In this work, we propose Serpent, an architecture that leverages recent advances in state space models (SSMs) in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. Our preliminary results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size.
comment: 7 pages, 5 figures, preliminary workshop submission of a comprehensive work to be released soon
☆ Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians
The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering fidelity and efficiency compared to NeRF-based neural scene representations. While demonstrating the potential for real-time rendering, 3D-GS encounters rendering bottlenecks in large scenes with complex details due to an excessive number of Gaussian primitives located within the viewing frustum. This limitation is particularly noticeable in zoom-out views and can lead to inconsistent rendering speeds in scenes with varying details. Moreover, it often struggles to capture the corresponding level of details at different scales with its heuristic density control operation. Inspired by the Level-of-Detail (LOD) techniques, we introduce Octree-GS, featuring an LOD-structured 3D Gaussian approach supporting level-of-detail decomposition for scene representation that contributes to the final rendering results. Our model dynamically selects the appropriate level from the set of multi-resolution anchor points, ensuring consistent rendering performance with adaptive LOD adjustments while maintaining high-fidelity rendering results.
comment: Project page: https://city-super.github.io/octree-gs/
☆ A Survey on 3D Egocentric Human Pose Estimation
Egocentric human pose estimation aims to estimate human body poses and develop body representations from a first-person camera perspective. It has gained vast popularity in recent years because of its wide range of applications in sectors like XR-technologies, human-computer interaction, and fitness tracking. However, to the best of our knowledge, there is no systematic literature review based on the proposed solutions regarding egocentric 3D human pose estimation. To that end, the aim of this survey paper is to provide an extensive overview of the current state of egocentric pose estimation research. In this paper, we categorize and discuss the popular datasets and the different pose estimation models, highlighting the strengths and weaknesses of different methods by comparative analysis. This survey can be a valuable resource for both researchers and practitioners in the field, offering insights into key concepts and cutting-edge solutions in egocentric pose estimation, its wide-ranging applications, as well as the open problems with future scope.
☆ 2D Gaussian Splatting for Geometrically Accurate Radiance Fields
3D Gaussian Splatting (3DGS) has recently revolutionized radiance field reconstruction, achieving high quality novel view synthesis and fast rendering speed without baking. However, 3DGS fails to accurately represent surfaces due to the multi-view inconsistent nature of 3D Gaussians. We present 2D Gaussian Splatting (2DGS), a novel approach to model and reconstruct geometrically accurate radiance fields from multi-view images. Our key idea is to collapse the 3D volume into a set of 2D oriented planar Gaussian disks. Unlike 3D Gaussians, 2D Gaussians provide view-consistent geometry while modeling surfaces intrinsically. To accurately recover thin surfaces and achieve stable optimization, we introduce a perspective-accurate 2D splatting process utilizing ray-splat intersection and rasterization. Additionally, we incorporate depth distortion and normal consistency terms to further enhance the quality of the reconstructions. We demonstrate that our differentiable renderer allows for noise-free and detailed geometry reconstruction while maintaining competitive appearance quality, fast training speed, and real-time rendering. Our code will be made publicly available.
comment: 12 pages, 12 figures
☆ Sen2Fire: A Challenging Benchmark Dataset for Wildfire Detection using Sentinel Data
Utilizing satellite imagery for wildfire detection presents substantial potential for practical applications. To advance the development of machine learning algorithms in this domain, our study introduces the \textit{Sen2Fire} dataset--a challenging satellite remote sensing dataset tailored for wildfire detection. This dataset is curated from Sentinel-2 multi-spectral data and Sentinel-5P aerosol product, comprising a total of 2466 image patches. Each patch has a size of 512$\times$512 pixels with 13 bands. Given the distinctive sensitivities of various wavebands to wildfire responses, our research focuses on optimizing wildfire detection by evaluating different wavebands and employing a combination of spectral indices, such as normalized burn ratio (NBR) and normalized difference vegetation index (NDVI). The results suggest that, in contrast to using all bands for wildfire detection, selecting specific band combinations yields superior performance. Additionally, our study underscores the positive impact of integrating Sentinel-5 aerosol data for wildfire detection. The code and dataset are available online (https://zenodo.org/records/10881058).
☆ Superior and Pragmatic Talking Face Generation with Teacher-Student Framework
Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.
☆ Deepfake Generation and Detection: A Benchmark and Survey
In addition to the advancements in deepfake generation, corresponding detection technologies need to continuously evolve to regulate the potential misuse of deepfakes, such as for privacy invasion and phishing attacks. This survey comprehensively reviews the latest developments in deepfake generation and detection, summarizing and analyzing the current state of the art in this rapidly evolving field. We first unify task definitions, comprehensively introduce datasets and metrics, and discuss the development of generation and detection technology frameworks. Then, we discuss the development of several related sub-fields and focus on researching four mainstream deepfake fields: popular face swap, face reenactment, talking face generation, and facial attribute editing, as well as foreign detection. Subsequently, we comprehensively benchmark representative methods on popular datasets for each field, fully evaluating the latest and influential works published in top conferences/journals. Finally, we analyze the challenges and future research directions of the discussed fields. We closely follow the latest developments in https://github.com/flyingby/Awesome-Deepfake-Generation-and-Detection.
☆ Low-Latency Neural Stereo Streaming CVPR2024
The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallelization and runtime performance. This work presents Low-Latency neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video coding method designed for fast and efficient low-latency stereo video streaming. Instead of using a sequential cross-view motion compensation like existing methods, LLSS introduces a bidirectional feature shifting module to directly exploit mutual information among views and encode them effectively with a joint cross-view prior model for entropy coding. Thanks to this design, LLSS processes left and right views in parallel, minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs.
comment: Accepted by CVPR2024
☆ Boosting Diffusion Models with Moving Average Sampling in Frequency Domain CVPR 2024
Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.
comment: CVPR 2024
☆ To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of 3D Transfer Learning
Transfer learning has long been a key factor in the advancement of many fields including 2D image analysis. Unfortunately, its applicability in 3D data processing has been relatively limited. While several approaches for 3D transfer learning have been proposed in recent literature, with contrastive learning gaining particular prominence, most existing methods in this domain have only been studied and evaluated in limited scenarios. Most importantly, there is currently a lack of principled understanding of both when and why 3D transfer learning methods are applicable. Remarkably, even the applicability of standard supervised pre-training is poorly understood. In this work, we conduct the first in-depth quantitative and qualitative investigation of supervised and contrastive pre-training strategies and their utility in downstream 3D tasks. We demonstrate that layer-wise analysis of learned features provides significant insight into the downstream utility of trained networks. Informed by this analysis, we propose a simple geometric regularization strategy, which improves the transferability of supervised pre-training. Our work thus sheds light onto both the specific challenges of 3D transfer learning, as well as strategies to overcome them.
☆ Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/.
comment: Code and video are available at http://hovsg.github.io/
☆ ReMamber: Referring Image Segmentation with Mamba Twister
Referring Image Segmentation (RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve the state-of-the-art on three challenging benchmarks. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research.
☆ GTA-HDR: A Large-Scale Synthetic Dataset for HDR Image Reconstruction
High Dynamic Range (HDR) content (i.e., images and videos) has a broad range of applications. However, capturing HDR content from real-world scenes is expensive and time- consuming. Therefore, the challenging task of reconstructing visually accurate HDR images from their Low Dynamic Range (LDR) counterparts is gaining attention in the vision research community. A major challenge in this research problem is the lack of datasets, which capture diverse scene conditions (e.g., lighting, shadows, weather, locations, landscapes, objects, humans, buildings) and various image features (e.g., color, contrast, saturation, hue, luminance, brightness, radiance). To address this gap, in this paper, we introduce GTA-HDR, a large-scale synthetic dataset of photo-realistic HDR images sampled from the GTA-V video game. We perform thorough evaluation of the proposed dataset, which demonstrates significant qualitative and quantitative improvements of the state-of-the-art HDR image reconstruction methods. Furthermore, we demonstrate the effectiveness of the proposed dataset and its impact on additional computer vision tasks including 3D human pose estimation, human body part segmentation, and holistic scene segmentation. The dataset, data collection pipeline, and evaluation code are available at: https://github.com/HrishavBakulBarua/GTA-HDR.
comment: Submitted to IEEE
☆ A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities
A major challenge in computational research in 3D medical imaging is the lack of comprehensive datasets. Addressing this issue, our study introduces CT-RATE, the first 3D medical imaging dataset that pairs images with textual reports. CT-RATE consists of 25,692 non-contrast chest CT volumes, expanded to 50,188 through various reconstructions, from 21,304 unique patients, along with corresponding radiology text reports. Leveraging CT-RATE, we developed CT-CLIP, a CT-focused contrastive language-image pre-training framework. As a versatile, self-supervised model, CT-CLIP is designed for broad application and does not require task-specific training. Remarkably, CT-CLIP outperforms state-of-the-art, fully supervised methods in multi-abnormality detection across all key metrics, thus eliminating the need for manual annotation. We also demonstrate its utility in case retrieval, whether using imagery or textual queries, thereby advancing knowledge dissemination. The open-source release of CT-RATE and CT-CLIP marks a significant advancement in medical AI, enhancing 3D imaging analysis and fostering innovation in healthcare.
☆ Assessment of Multimodal Large Language Models in Alignment with Human Values
Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.
comment: arXiv admin note: text overlap with arXiv:2311.02692
☆ DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and a text-based interaction stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the interaction phase. For textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the interaction stage.
comment: Project Page: https://diffh2o.github.io/
☆ Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders
Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn objects from motion, but rather thanks to the Siamese architecture. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.
comment: 19 pages, 6 figures, 3 tables, 1 page of supplementary material
☆ DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing
3D Gaussian splatting, a novel differentiable rendering technique, has achieved state-of-the-art novel view synthesis results with high rendering speeds and relatively low training times. However, its performance on scenes commonly seen in indoor datasets is poor due to the lack of geometric constraints during optimization. We extend 3D Gaussian splatting with depth and normal cues to tackle challenging indoor datasets and showcase techniques for efficient mesh extraction, an important downstream application. Specifically, we regularize the optimization procedure with depth information, enforce local smoothness of nearby Gaussians, and use the geometry of the 3D Gaussians supervised by normal cues to achieve better alignment with the true scene geometry. We improve depth estimation and novel view synthesis results over baselines and show how this simple yet effective regularization technique can be used to directly extract meshes from the Gaussian representation yielding more physically accurate reconstructions on indoor scenes. Our code will be released in https://github.com/maturk/dn-splatter.
☆ Annotated Biomedical Video Generation using Denoising Diffusion Probabilistic Models and Flow Fields
The segmentation and tracking of living cells play a vital role within the biomedical domain, particularly in cancer research, drug development, and developmental biology. These are usually tedious and time-consuming tasks that are traditionally done by biomedical experts. Recently, to automatize these processes, deep learning based segmentation and tracking methods have been proposed. These methods require large-scale datasets and their full potential is constrained by the scarcity of annotated data in the biomedical imaging domain. To address this limitation, we propose Biomedical Video Diffusion Model (BVDM), capable of generating realistic-looking synthetic microscopy videos. Trained only on a single real video, BVDM can generate videos of arbitrary length with pixel-level annotations that can be used for training data-hungry models. It is composed of a denoising diffusion probabilistic model (DDPM) generating high-fidelity synthetic cell microscopy images and a flow prediction model (FPM) predicting the non-rigid transformation between consecutive video frames. During inference, initially, the DDPM imposes realistic cell textures on synthetic cell masks which are generated based on real data statistics. The flow prediction model predicts the flow field between consecutive masks and applies that to the DDPM output from the previous time frame to create the next one while keeping temporal consistency. BVDM outperforms state-of-the-art synthetic live cell microscopy video generation models. Furthermore, we demonstrate that a sufficiently large synthetic dataset enhances the performance of cell segmentation and tracking models compared to using a limited amount of available real data.
☆ Improving Text-to-Image Consistency via Automatic Prompt Optimization
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
☆ Towards 3D Vision with Low-Cost Single-Photon Cameras
We present a method for reconstructing 3D shape of arbitrary Lambertian objects based on measurements by miniature, energy-efficient, low-cost single-photon cameras. These cameras, operating as time resolved image sensors, illuminate the scene with a very fast pulse of diffuse light and record the shape of that pulse as it returns back from the scene at a high temporal resolution. We propose to model this image formation process, account for its non-idealities, and adapt neural rendering to reconstruct 3D geometry from a set of spatially distributed sensors with known poses. We show that our approach can successfully recover complex 3D shapes from simulated data. We further demonstrate 3D object reconstruction from real-world captures, utilizing measurements from a commodity proximity sensor. Our work draws a connection between image-based modeling and active range scanning and is a step towards 3D vision with single-photon cameras.
☆ Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), such as Gemini-pro, which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data, integrating both textual and visual information on a scale previously unattainable, opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered Gemini-pro LMMs versus fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct tasks: a visually evident task of detecting simple triggers, such as small squares in images, indicative of potential backdoors, and a non-visually evident task of malware classification through visual representations. Our results highlight a significant divergence in performance, with Gemini-pro falling short in accuracy and reliability when compared to fine-tuned ViT models. The ViT models, on the other hand, demonstrate exceptional accuracy, achieving near-perfect performance on both tasks. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.
☆ GenesisTex: Adapting Image Denoising Diffusion to Texture Space
We present GenesisTex, a novel method for synthesizing textures for 3D geometries from text descriptions. GenesisTex adapts the pretrained image diffusion model to texture space by texture space sampling. Specifically, we maintain a latent texture map for each viewpoint, which is updated with predicted noise on the rendering of the corresponding viewpoint. The sampled latent texture maps are then decoded into a final texture map. During the sampling process, we focus on both global and local consistency across multiple viewpoints: global consistency is achieved through the integration of style consistency mechanisms within the noise prediction network, and low-level consistency is achieved by dynamically aligning latent textures. Finally, we apply reference-based inpainting and img2img on denser views for texture refinement. Our approach overcomes the limitations of slow optimization in distillation-based methods and instability in inpainting-based methods. Experiments on meshes from various sources demonstrate that our method surpasses the baseline methods quantitatively and qualitatively.
comment: 12 pages, 10 figures
☆ CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation
Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task.
☆ MUTE-SLAM: Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations
We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. It dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only speeds up convergence but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code will be made public upon acceptance of the paper.
☆ Makeup Prior Models for 3D Facial Makeup Estimation and Applications CVPR2024
In this work, we introduce two types of makeup prior models to extend existing 3D face prior models: PCA-based and StyleGAN2-based priors. The PCA-based prior model is a linear model that is easy to construct and is computationally efficient. However, it retains only low-frequency information. Conversely, the StyleGAN2-based model can represent high-frequency information with relatively higher computational cost than the PCA-based model. Although there is a trade-off between the two models, both are applicable to 3D facial makeup estimation and related applications. By leveraging makeup prior models and designing a makeup consistency module, we effectively address the challenges that previous methods faced in robustly estimating makeup, particularly in the context of handling self-occluded faces. In experiments, we demonstrate that our approach reduces computational costs by several orders of magnitude, achieving speeds up to 180 times faster. In addition, by improving the accuracy of the estimated makeup, we confirm that our methods are highly advantageous for various 3D facial makeup applications such as 3D makeup face reconstruction, user-friendly makeup editing, makeup transfer, and interpolation.
comment: CVPR2024. Project: https://yangxingchao.github.io/makeup-priors-page
☆ Noise2Noise Denoising of CRISM Hyperspectral Data ICLR 2024
Hyperspectral data acquired by the Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) have allowed for unparalleled mapping of the surface mineralogy of Mars. Due to sensor degradation over time, a significant portion of the recently acquired data is considered unusable. Here a new data-driven model architecture, Noise2Noise4Mars (N2N4M), is introduced to remove noise from CRISM images. Our model is self-supervised and does not require zero-noise target data, making it well suited for use in Planetary Science applications where high quality labelled data is scarce. We demonstrate its strong performance on synthetic-noise data and CRISM images, and its impact on downstream classification performance, outperforming benchmark methods on most metrics. This allows for detailed analysis for critical sites of interest on the Martian surface, including proposed lander sites.
comment: 5 pages, 3 figures. Accepted as a conference paper at the ICLR 2024 ML4RS Workshop
☆ DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright Protection
In the realm of healthcare, the challenges of copyright protection and unauthorized third-party misuse are increasingly significant. Traditional methods for data copyright protection are applied prior to data distribution, implying that models trained on these data become uncontrollable. This paper introduces a novel approach, named DataCook, designed to safeguard the copyright of healthcare data during the deployment phase. DataCook operates by "cooking" the raw data before distribution, enabling the development of models that perform normally on this processed data. However, during the deployment phase, the original test data must be also "cooked" through DataCook to ensure normal model performance. This process grants copyright holders control over authorization during the deployment phase. The mechanism behind DataCook is by crafting anti-adversarial examples (AntiAdv), which are designed to enhance model confidence, as opposed to standard adversarial examples (Adv) that aim to confuse models. Similar to Adv, AntiAdv introduces imperceptible perturbations, ensuring that the data processed by DataCook remains easily understandable. We conducted extensive experiments on MedMNIST datasets, encompassing both 2D/3D data and the high-resolution variants. The outcomes indicate that DataCook effectively meets its objectives, preventing models trained on AntiAdv from analyzing unauthorized data effectively, without compromising the validity and accuracy of the data in legitimate scenarios. Code and data are available at https://github.com/MedMNIST/DataCook.
☆ Multi-Task Dense Prediction via Mixture of Low-Rank Experts CVPR 2024
Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a generic convolution path to the original MoE structure, where each task feature can go through this path for explicit parameter sharing. Furthermore, to control the parameters and computational cost brought by the increase in the number of experts, we take inspiration from LoRA and propose to leverage the low-rank format of a vanilla convolution in the expert network. Since the low-rank experts have fewer parameters and can be dynamically parameterized into the generic convolution, the parameters and computational cost do not change much with the increase of experts. Benefiting from this design, we increase the number of experts and its reception field to enlarge the representation capacity, facilitating multiple dense tasks learning in a unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 benchmarks show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics. Our code is available at https://github.com/YuqiYang213/MLoRE.
comment: Accepted at CVPR 2024
☆ Paired Diffusion: Generation of related, synthetic PET-CT-Segmentation scans using Linked Denoising Diffusion Probabilistic Models
The rapid advancement of Artificial Intelligence (AI) in biomedical imaging and radiotherapy is hindered by the limited availability of large imaging data repositories. With recent research and improvements in denoising diffusion probabilistic models (DDPM), high quality synthetic medical scans are now possible. Despite this, there is currently no way of generating multiple related images, such as a corresponding ground truth which can be used to train models, so synthetic scans are often manually annotated before use. This research introduces a novel architecture that is able to generate multiple, related PET-CT-tumour mask pairs using paired networks and conditional encoders. Our approach includes innovative, time step-controlled mechanisms and a `noise-seeding' strategy to improve DDPM sampling consistency. While our model requires a modified perceptual loss function to ensure accurate feature alignment we show generation of clearly aligned synthetic images and improvement in segmentation accuracy with generated images.
comment: to be published in IEEE International Symposium on Biomedical Imaging 2024
☆ FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts
Quickly understanding lengthy lecture videos is essential for learners with limited time and interest in various topics to improve their learning efficiency. To this end, video summarization has been actively researched to enable users to view only important scenes from a video. However, these studies focus on either the visual or audio information of a video and extract important segments in the video. Therefore, there is a risk of missing important information when both the teacher's speech and visual information on the blackboard or slides are important, such as in a lecture video. To tackle this issue, we propose FastPerson, a video summarization approach that considers both the visual and auditory information in lecture videos. FastPerson creates summary videos by utilizing audio transcriptions along with on-screen images and text, minimizing the risk of overlooking crucial information for learners. Further, it provides a feature that allows learners to switch between the summary and original videos for each chapter of the video, enabling them to adjust the pace of learning based on their interests and level of understanding. We conducted an evaluation with 40 participants to assess the effectiveness of our method and confirmed that it reduced viewing time by 53\% at the same level of comprehension as that when using traditional video playback methods.
☆ Deep Learning for Segmentation of Cracks in High-Resolution Images of Steel Bridges
Automating the current bridge visual inspection practices using drones and image processing techniques is a prominent way to make these inspections more effective, robust, and less expensive. In this paper, we investigate the development of a novel deep-learning method for the detection of fatigue cracks in high-resolution images of steel bridges. First, we present a novel and challenging dataset comprising of images of cracks in steel bridges. Secondly, we integrate the ConvNext neural network with a previous state- of-the-art encoder-decoder network for crack segmentation. We study and report, the effects of the use of background patches on the network performance when applied to high-resolution images of cracks in steel bridges. Finally, we introduce a loss function that allows the use of more background patches for the training process, which yields a significant reduction in false positive rates.
☆ Invisible Gas Detection: An RGB-Thermal Cross Attention Network and A New Benchmark
The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the RGB-Thermal Cross Attention Network (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data will be made available soon.
☆ Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection CVPR 2024
Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at https://github.com/mlvlab/SpeaQ.
comment: CVPR 2024
☆ Panonut360: A Head and Eye Tracking Dataset for Panoramic Video ACM MM
With the rapid development and widespread application of VR/AR technology, maximizing the quality of immersive panoramic video services that match users' personal preferences and habits has become a long-standing challenge. Understanding the saliency region where users focus, based on data collected with HMDs, can promote multimedia encoding, transmission, and quality assessment. At the same time, large-scale datasets are essential for researchers and developers to explore short/long-term user behavior patterns and train AI models related to panoramic videos. However, existing panoramic video datasets often include low-frequency user head or eye movement data through short-term videos only, lacking sufficient data for analyzing users' Field of View (FoV) and generating video saliency regions. Driven by these practical factors, in this paper, we present a head and eye tracking dataset involving 50 users (25 males and 25 females) watching 15 panoramic videos. The dataset provides details on the viewport and gaze attention locations of users. Besides, we present some statistics samples extracted from the dataset. For example, the deviation between head and eye movements challenges the widely held assumption that gaze attention decreases from the center of the FoV following a Gaussian distribution. Our analysis reveals a consistent downward offset in gaze fixations relative to the FoV in experimental settings involving multiple users and videos. That's why we name the dataset Panonut, a saliency weighting shaped like a donut. Finally, we also provide a script that generates saliency distributions based on given head or eye coordinates and pre-generated saliency distribution map sets of each video from the collected eye tracking data. The dataset is available on website: https://dianvrlab.github.io/Panonut360/.
comment: 7 pages,ACM MMSys'24 accepted
☆ The Solution for the CVPR 2023 1st foundation model challenge-Track2
In this paper, we propose a solution for cross-modal transportation retrieval. Due to the cross-domain problem of traffic images, we divide the problem into two sub-tasks of pedestrian retrieval and vehicle retrieval through a simple strategy. In pedestrian retrieval tasks, we use IRRA as the base model and specifically design an Attribute Classification to mine the knowledge implied by attribute labels. More importantly, We use the strategy of Inclusion Relation Matching to make the image-text pairs with inclusion relation have similar representation in the feature space. For the vehicle retrieval task, we use BLIP as the base model. Since aligning the color attributes of vehicles is challenging, we introduce attribute-based object detection techniques to add color patch blocks to vehicle images for color data augmentation. This serves as strong prior information, helping the model perform the image-text alignment. At the same time, we incorporate labeled attributes into the image-text alignment loss to learn fine-grained alignment and prevent similar images and texts from being incorrectly separated. Our approach ranked first in the final B-board test with a score of 70.9.
☆ Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation
Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its variants, have demonstrated notable performance in the field of vision. However, their feature extraction methods may not be sufficiently effective and retain some redundant structures, leaving room for parameter reduction. Motivated by previous spatial and channel attention methods, we propose Triplet Mamba-UNet. The method leverages residual VSS Blocks to extract intensive contextual features, while Triplet SSM is employed to fuse features across spatial and channel dimensions. We conducted experiments on ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, and Kvasir-Instrument datasets, demonstrating the superior segmentation performance of our proposed TM-UNet. Additionally, compared to the previous VM-UNet, our model achieves a one-third reduction in parameters.
☆ PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition
We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba
☆ AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation
In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at https://github.com/scutzzj/AniPortrait
☆ Manifold-Guided Lyapunov Control with Diffusion Models
This paper presents a novel approach to generating stabilizing controllers for a large class of dynamical systems using diffusion models. The core objective is to develop stabilizing control functions by identifying the closest asymptotically stable vector field relative to a predetermined manifold and adjusting the control function based on this finding. To achieve this, we employ a diffusion model trained on pairs consisting of asymptotically stable vector fields and their corresponding Lyapunov functions. Our numerical results demonstrate that this pre-trained model can achieve stabilization over previously unseen systems efficiently and rapidly, showcasing the potential of our approach in fast zero-shot control and generalizability.
comment: 14 pages
☆ Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to Inform GenAI Copyright Disputes
The advent of Generative Artificial Intelligence (GenAI) models, including GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content creation, enabling non-professionals to produce high-quality content across various domains. This transformative technology has led to a surge of synthetic content and sparked legal disputes over copyright infringement. To address these challenges, this paper introduces a novel approach that leverages the learning capacity of GenAI models for copyright legal analysis, demonstrated with GPT2 and Stable Diffusion models. Copyright law distinguishes between original expressions and generic ones (Sc\`enes \`a faire), protecting the former and permitting reproduction of the latter. However, this distinction has historically been challenging to make consistently, leading to over-protection of copyrighted works. GenAI offers an unprecedented opportunity to enhance this legal analysis by revealing shared patterns in preexisting works. We propose a data-driven approach to identify the genericity of works created by GenAI, employing "data-driven bias" to assess the genericity of expressive compositions. This approach aids in copyright scope determination by utilizing the capabilities of GenAI to identify and prioritize expressive elements and rank them according to their frequency in the model's dataset. The potential implications of measuring expressive genericity for copyright law are profound. Such scoring could assist courts in determining copyright scope during litigation, inform the registration practices of Copyright Offices, allowing registration of only highly original synthetic works, and help copyright owners signal the value of their works and facilitate fairer licensing deals. More generally, this approach offers valuable insights to policymakers grappling with adapting copyright law to the challenges posed by the era of GenAI.
comment: Presented at ACM CSLAW 2024
☆ Hierarchical Light Transformer Ensembles for Multimodal Trajectory Forecasting
Accurate trajectory forecasting is crucial for the performance of various systems, such as advanced driver-assistance systems and self-driving vehicles. These forecasts allow to anticipate events leading to collisions and, therefore, to mitigate them. Deep Neural Networks have excelled in motion forecasting, but issues like overconfidence and uncertainty quantification persist. Deep Ensembles address these concerns, yet applying them to multimodal distributions remains challenging. In this paper, we propose a novel approach named Hierarchical Light Transformer Ensembles (HLT-Ens), aimed at efficiently training an ensemble of Transformer architectures using a novel hierarchical loss function. HLT-Ens leverages grouped fully connected layers, inspired by grouped convolution techniques, to capture multimodal distributions, effectively. Through extensive experimentation, we demonstrate that HLT-Ens achieves state-of-the-art performance levels, offering a promising avenue for improving trajectory forecasting techniques.
☆ Predicting Perceived Gloss: Do Weak Labels Suffice?
Estimating perceptual attributes of materials directly from images is a challenging task due to their complex, not fully-understood interactions with external factors, such as geometry and lighting. Supervised deep learning models have recently been shown to outperform traditional approaches, but rely on large datasets of human-annotated images for accurate perception predictions. Obtaining reliable annotations is a costly endeavor, aggravated by the limited ability of these models to generalise to different aspects of appearance. In this work, we show how a much smaller set of human annotations ("strong labels") can be effectively augmented with automatically derived "weak labels" in the context of learning a low-dimensional image-computable gloss metric. We evaluate three alternative weak labels for predicting human gloss perception from limited annotated data. Incorporating weak labels enhances our gloss prediction beyond the current state of the art. Moreover, it enables a substantial reduction in human annotation costs without sacrificing accuracy, whether working with rendered images or real photographs.
comment: Computer Graphics Forum (Eurographics 2024)
☆ DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation
Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.
☆ Exploring Dynamic Transformer for Efficient Object Tracking
The speed-precision trade-off is a critical problem for visual object tracking which usually requires low latency and deployment on constrained resources. Existing solutions for efficient tracking mainly focus on adopting light-weight backbones or modules, which nevertheless come at the cost of a sacrifice in precision. In this paper, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit diverse levels of complexity. We argue that a simple network is sufficient for easy frames in video sequences, while more computation could be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Thus, it can achieve higher performance with the same running speed. We formulate instance-specific tracking as a sequential decision problem and attach terminating branches to intermediate layers of the entire model. Especially, to fully utilize the computations, we introduce the feature recycling mechanism to reuse the outputs of predecessors. Furthermore, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early predictions by effectively mimicking the representation pattern of the deep model. Extensive experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model. For instance, DyTrack obtains 64.9% AUC on LaSOT with a speed of 256 fps.
☆ High-Resolution Image Translation Model Based on Grayscale Redefinition
Image-to-image translation is a technique that focuses on transferring images from one domain to another while maintaining the essential content representations. In recent years, image-to-image translation has gained significant attention and achieved remarkable advancements due to its diverse applications in computer vision and image processing tasks. In this work, we propose an innovative method for image translation between different domains. For high-resolution image translation tasks, we use a grayscale adjustment method to achieve pixel-level translation. For other tasks, we utilize the Pix2PixHD model with a coarse-to-fine generator, multi-scale discriminator, and improved loss to enhance the image translation performance. On the other hand, to tackle the issue of sparse training data, we adopt model weight initialization from other task to optimize the performance of the current task.
☆ Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency CVPR 2024
We propose a voxel-based optimization framework, ReVoRF, for few-shot radiance fields that strategically address the unreliability in pseudo novel view synthesis. Our method pivots on the insight that relative depth relationships within neighboring regions are more reliable than the absolute color values in disoccluded areas. Consequently, we devise a bilateral geometric consistency loss that carefully navigates the trade-off between color fidelity and geometric accuracy in the context of depth consistency for uncertain regions. Moreover, we present a reliability-guided learning strategy to discern and utilize the variable quality across synthesized views, complemented by a reliability-aware voxel smoothing algorithm that smoothens the transition between reliable and unreliable data patches. Our approach allows for a more nuanced use of all available data, promoting enhanced learning from regions previously considered unsuitable for high-quality reconstruction. Extensive experiments across diverse datasets reveal that our approach attains significant gains in efficiency and accuracy, delivering rendering speeds of 3 FPS, 7 mins to train a $360^\circ$ scene, and a 5\% improvement in PSNR over existing few-shot methods. Code is available at https://github.com/HKCLynn/ReVoRF.
comment: CVPR 2024 final version
☆ UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps
In this study, we address a gap in existing unsupervised domain adaptation approaches on LiDAR-based 3D object detection, which have predominantly concentrated on adapting between established, high-density autonomous driving datasets. We focus on sparser point clouds, capturing scenarios from different perspectives: not just from vehicles on the road but also from mobile robots on sidewalks, which encounter significantly different environmental conditions and sensor configurations. We introduce Unsupervised Adversarial Domain Adaptation for 3D Object Detection (UADA3D). UADA3D does not depend on pre-trained source models or teacher-student architectures. Instead, it uses an adversarial approach to directly learn domain-invariant features. We demonstrate its efficacy in various adaptation scenarios, showing significant improvements in both self-driving car and mobile robot domains. Our code is open-source and will be available soon.
☆ AniArtAvatar: Animatable 3D Art Avatar from a Single Image
We present a novel approach for generating animatable 3D-aware art avatars from a single image, with controllable facial expressions, head poses, and shoulder movements. Unlike previous reenactment methods, our approach utilizes a view-conditioned 2D diffusion model to synthesize multi-view images from a single art portrait with a neutral expression. With the generated colors and normals, we synthesize a static avatar using an SDF-based neural surface. For avatar animation, we extract control points, transfer the motion with these points, and deform the implicit canonical space. Firstly, we render the front image of the avatar, extract the 2D landmarks, and project them to the 3D space using a trained SDF network. We extract 3D driving landmarks using 3DMM and transfer the motion to the avatar landmarks. To animate the avatar pose, we manually set the body height and bound the head and torso of an avatar with two cages. The head and torso can be animated by transforming the two cages. Our approach is a one-shot pipeline that can be applied to various styles. Experiments demonstrate that our method can generate high-quality 3D art avatars with desired control over different motions.
☆ Grad-CAMO: Learning Interpretable Single-Cell Morphological Profiles from 3D Cell Painting Images
Despite their black-box nature, deep learning models are extensively used in image-based drug discovery to extract feature vectors from single cells in microscopy images. To better understand how these networks perform representation learning, we employ visual explainability techniques (e.g., Grad-CAM). Our analyses reveal several mechanisms by which supervised models cheat, exploiting biologically irrelevant pixels when extracting morphological features from images, such as noise in the background. This raises doubts regarding the fidelity of learned single-cell representations and their relevance when investigating downstream biological questions. To address this misalignment between researcher expectations and machine behavior, we introduce Grad-CAMO, a novel single-cell interpretability score for supervised feature extractors. Grad-CAMO measures the proportion of a model's attention that is concentrated on the cell of interest versus the background. This metric can be assessed per-cell or averaged across a validation set, offering a tool to audit individual features vectors or guide the improved design of deep learning architectures. Importantly, Grad-CAMO seamlessly integrates into existing workflows, requiring no dataset or model modifications, and is compatible with both 2D and 3D Cell Painting data. Additional results are available at https://github.com/eigenvivek/Grad-CAMO.
☆ MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors CVPR2024
Foot contact is an important cue not only for human motion capture but also for motion understanding and physically plausible motion generation. However, most of the foot-contact annotations in existing datasets are estimated by purely visual matching and distance thresholding, which results in low accuracy and coarse granularity. Even though existing multimodal datasets synergistically capture plantar pressure (foot contact) and visual signals, they are specifically designed for small-range and slow motion such as Taiji Quan and Yoga. Therefore, there is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion, as well as accurate and dense foot-contact annotation. To fill this gap, we propose a Multimodal MoCap Dataset with Vision and Pressure sensors, named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations, which is especially useful for both plausible shape estimation, robust pose fitting without foot drifting, and accurate global translation tracking. To validate the dataset, we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework, VP-MoCap, for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://haolyuan.github.io/MMVP-Dataset/.
comment: CVPR2024
☆ Fake or JPEG? Revealing Common Biases in Generated Image Detection Datasets
The widespread adoption of generative image models has highlighted the urgent need to detect artificial content, which is a crucial step in combating widespread manipulation and misinformation. Consequently, numerous detectors and associated datasets have emerged. However, many of these datasets inadvertently introduce undesirable biases, thereby impacting the effectiveness and evaluation of detectors. In this paper, we emphasize that many datasets for AI-generated image detection contain biases related to JPEG compression and image size. Using the GenImage dataset, we demonstrate that detectors indeed learn from these undesired factors. Furthermore, we show that removing the named biases substantially increases robustness to JPEG compression and significantly alters the cross-generator performance of evaluated detectors. Specifically, it leads to more than 11 percentage points increase in cross-generator performance for ResNet50 and Swin-T detectors on the GenImage dataset, achieving state-of-the-art results. We provide the dataset and source codes of this paper on the anonymous website: https://www.unbiased-genimage.org
♻ ☆ DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields
Advances in neural fields are enabling high-fidelity capture of the shape and appearance of dynamic 3D scenes. However, their capabilities lag behind those offered by conventional representations such as 2D videos because of algorithmic challenges and the lack of large-scale multi-view real-world datasets. We address the dataset limitation with DiVa-360, a real-world 360 dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences of table-scale scenes captured using a customized low-cost system with 53 cameras. It contains 21 object-centric sequences categorized by different motion types, 25 intricate hand-object interaction sequences, and 8 long-duration sequences for a total of 17.4 M image frames. In addition, we provide foreground-background segmentation masks, synchronized audio, and text descriptions. We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture.
♻ ☆ HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative CVPR 2024
Vehicle-to-everything (V2X) is a popular topic in the field of Autonomous Driving in recent years. Vehicle-infrastructure cooperation (VIC) becomes one of the important research area. Due to the complexity of traffic conditions such as blind spots and occlusion, it greatly limits the perception capabilities of single-view roadside sensing systems. To further enhance the accuracy of roadside perception and provide better information to the vehicle side, in this paper, we constructed holographic intersections with various layouts to build a large-scale multi-sensor holographic vehicle-infrastructure cooperation dataset, called HoloVIC. Our dataset includes 3 different types of sensors (Camera, Lidar, Fisheye) and employs 4 sensor-layouts based on the different intersections. Each intersection is equipped with 6-18 sensors to capture synchronous data. While autonomous vehicles pass through these intersections for collecting VIC data. HoloVIC contains in total on 100k+ synchronous frames from different sensors. Additionally, we annotated 3D bounding boxes based on Camera, Fisheye, and Lidar. We also associate the IDs of the same objects across different devices and consecutive frames in sequence. Based on HoloVIC, we formulated four tasks to facilitate the development of related research. We also provide benchmarks for these tasks.
comment: Accept to CVPR 2024, Benchmark Website: https://holovic.net
♻ ☆ TRIPS: Trilinear Point Splatting for Real-Time Radiance Field Rendering
Point-based radiance field rendering has demonstrated impressive results for novel view synthesis, offering a compelling blend of rendering quality and computational efficiency. However, also latest approaches in this domain are not without their shortcomings. 3D Gaussian Splatting [Kerbl and Kopanas et al. 2023] struggles when tasked with rendering highly detailed scenes, due to blurring and cloudy artifacts. On the other hand, ADOP [R\"uckert et al. 2022] can accommodate crisper images, but the neural reconstruction network decreases performance, it grapples with temporal instability and it is unable to effectively address large gaps in the point cloud. In this paper, we present TRIPS (Trilinear Point Splatting), an approach that combines ideas from both Gaussian Splatting and ADOP. The fundamental concept behind our novel technique involves rasterizing points into a screen-space image pyramid, with the selection of the pyramid layer determined by the projected point size. This approach allows rendering arbitrarily large points using a single trilinear write. A lightweight neural network is then used to reconstruct a hole-free image including detail beyond splat resolution. Importantly, our render pipeline is entirely differentiable, allowing for automatic optimization of both point sizes and positions. Our evaluation demonstrate that TRIPS surpasses existing state-of-the-art methods in terms of rendering quality while maintaining a real-time frame rate of 60 frames per second on readily available hardware. This performance extends to challenging scenarios, such as scenes featuring intricate geometry, expansive landscapes, and auto-exposed footage. The project page is located at: https://lfranke.github.io/trips/
♻ ☆ Semi-Supervised Crowd Counting from Unlabeled Data
Automatic Crowd behavior analysis can be applied to effectively help the daily transportation statistics and planning, which helps the smart city construction. As one of the most important keys, crowd counting has drawn increasing attention. Recent works achieved promising performance but relied on the supervised paradigm with expensive crowd annotations. To alleviate the annotation cost in real-world transportation scenarios, in this work we proposed a semi-supervised learning framework $S^{4}\textit{Crowd}$, which can leverage both unlabeled/labeled data for robust crowd counting. In the unsupervised pathway, two \textit{self-supervised losses} were proposed to simulate the crowd variations such as scale, illumination, based on which supervised information pseudo labels were generated and gradually refined. We also proposed a crowd-driven recurrent unit \textit{Gated-Crowd-Recurrent-Unit (GCRU)}, which can preserve discriminant crowd information by extracting second-order statistics, yielding pseudo labels with improved quality. A joint loss including both unsupervised/supervised information was proposed, and a dynamic weighting strategy was employed to balance the importance of the unsupervised loss and supervised loss at different training stages. We conducted extensive experiments on four popular crowd counting datasets in semi-supervised settings. Experimental results supported the effectiveness of each proposed component in our $S^{4}$Crowd framework. Our method achieved competitive performance in semi-supervised learning approaches on these crowd counting datasets.
♻ ☆ Efficient Pre-training for Localized Instruction Generation of Videos
Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets, enables efficient training of large-scale models with competitive performance. We complement our Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and instruction generation for procedural videos. When this model is pre-trained on our curated dataset, it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and Tasty, while using a fraction of the computational resources.
comment: This version has some missing experiments and elaborative technical details
♻ ☆ SimLVSeg: Simplifying Left Ventricular Segmentation in 2D+Time Echocardiograms with Self- and Weakly-Supervised Learning
Echocardiography has become an indispensable clinical imaging modality for general heart health assessment. From calculating biomarkers such as ejection fraction to the probability of a patient's heart failure, accurate segmentation of the heart structures allows doctors to assess the heart's condition and devise treatments with greater precision and accuracy. However, achieving accurate and reliable left ventricle segmentation is time-consuming and challenging due to different reasons. Hence, clinicians often rely on segmenting the left ventricular (LV) in two specific echocardiogram frames to make a diagnosis. This limited coverage in manual LV segmentation poses a challenge for developing automatic LV segmentation with high temporal consistency, as the resulting dataset is typically annotated sparsely. In response to this challenge, this work introduces SimLVSeg, a novel paradigm that enables video-based networks for consistent LV segmentation from sparsely annotated echocardiogram videos. SimLVSeg consists of self-supervised pre-training with temporal masking, followed by weakly supervised learning tailored for LV segmentation from sparse annotations. We demonstrate how SimLVSeg outperforms the state-of-the-art solutions by achieving a 93.32% (95%CI 93.21-93.43%) dice score on the largest 2D+time echocardiography dataset (EchoNet-Dynamic) while being more efficient. SimLVSeg is compatible with two types of video segmentation networks: 2D super image and 3D segmentation. To show the effectiveness of our approach, we provide extensive ablation studies, including pre-training settings and various deep learning backbones. We further conduct an out-of-distribution test to showcase SimLVSeg's generalizability on unseen distribution (CAMUS dataset). The code is publicly available at https://github.com/fadamsyah/SimLVSeg.
♻ ☆ HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction CVPR 2024
Vectorized High-Definition (HD) map construction requires predictions of the category and point coordinates of map elements (e.g. road boundary, lane divider, pedestrian crossing, etc.). State-of-the-art methods are mainly based on point-level representation learning for regressing accurate point coordinates. However, this pipeline has limitations in obtaining element-level information and handling element-level failures, e.g. erroneous element shape or entanglement between elements. To tackle the above issues, we propose a simple yet effective HybrId framework named HIMap to sufficiently learn and interact both point-level and element-level information. Concretely, we introduce a hybrid representation called HIQuery to represent all map elements, and propose a point-element interactor to interactively extract and encode the hybrid information of elements, e.g. point position and element shape, into the HIQuery. Additionally, we present a point-element consistency constraint to enhance the consistency between the point-level and element-level information. Finally, the output point-element integrated HIQuery can be directly converted into map elements' class, point coordinates, and mask. We conduct extensive experiments and consistently outperform previous methods on both nuScenes and Argoverse2 datasets. Notably, our method achieves $77.8$ mAP on the nuScenes dataset, remarkably superior to previous SOTAs by $8.3$ mAP at least.
comment: Accepted to CVPR 2024
♻ ☆ Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs.
♻ ☆ Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.
comment: Project page: https://argus-3d.github.io/ . Datasets: https://huggingface.co/datasets/BAAI/Objaverse-MIX. arXiv admin note: substantial text overlap with arXiv:2303.14700
♻ ☆ ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions
Current approaches for 3D human motion synthesis generate high-quality animations of digital humans performing a wide variety of actions and gestures. However, a notable technological gap exists in addressing the complex dynamics of multi-human interactions within this paradigm. In this work, we present ReMoS, a denoising diffusion-based model that synthesizes full-body reactive motion of a person in a two-person interaction scenario. Assuming the motion of one person is given, we employ a combined spatio-temporal cross-attention mechanism to synthesize the reactive body and hand motion of the second person, thereby completing the interactions between the two. We demonstrate ReMoS across challenging two-person scenarios such as pair-dancing, Ninjutsu, kickboxing, and acrobatics, where one person's movements have complex and diverse influences on the other. We also contribute the ReMoCap dataset for two-person interactions containing full-body and finger motions. We evaluate ReMoS through multiple quantitative metrics, qualitative visualizations, and a user study, and also indicate usability in interactive motion editing applications.
comment: 17 pages, 7 figures, 5 tables
♻ ☆ MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis
Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR). This paper introduces \textbf{MedPromptX}, the first model to integrate multimodal large language models (MLLMs), few-shot prompting (FP) and visual grounding (VG) to combine imagery with EHR data for chest X-ray diagnosis. A pre-trained MLLM is utilized to complement the missing EHR information, providing a comprehensive understanding of patients' medical history. Additionally, FP reduces the necessity for extensive training of MLLMs while effectively tackling the issue of hallucination. Nevertheless, the process of determining the optimal number of few-shot examples and selecting high-quality candidates can be burdensome, yet it profoundly influences model performance. Hence, we propose a new technique that dynamically refines few-shot data for real-time adjustment to new patient scenarios. Moreover, VG aids in focusing the model's attention on relevant regions of interest in X-ray images, enhancing the identification of abnormalities. We release MedPromptX-VQA, a new in-context visual question answering dataset encompassing interleaved image and EHR data derived from MIMIC-IV and MIMIC-CXR databases. Results demonstrate the SOTA performance of MedPromptX, achieving an 11% improvement in F1-score compared to the baselines. Code and data are available at https://github.com/BioMedIA-MBZUAI/MedPromptX
♻ ☆ Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation CVPR 2024
We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images.
comment: 18 pages, Accepted to CVPR 2024
♻ ☆ Identity-aware Dual-constraint Network for Cloth-Changing Person Re-identification
Cloth-Changing Person Re-Identification (CC-ReID) aims to accurately identify the target person in more realistic surveillance scenarios, where pedestrians usually change their clothing. Despite great progress, limited cloth-changing training samples in existing CC-ReID datasets still prevent the model from adequately learning cloth-irrelevant features. In addition, due to the absence of explicit supervision to keep the model constantly focused on cloth-irrelevant areas, existing methods are still hampered by the disruption of clothing variations. To solve the above issues, we propose an Identity-aware Dual-constraint Network (IDNet) for the CC-ReID task. Specifically, to help the model extract cloth-irrelevant clues, we propose a Clothes Diversity Augmentation (CDA), which generates more realistic cloth-changing samples by enriching the clothing color while preserving the texture. In addition, a Multi-scale Constraint Block (MCB) is designed, which extracts fine-grained identity-related features and effectively transfers cloth-irrelevant knowledge. Moreover, a Counterfactual-guided Attention Module (CAM) is presented, which learns cloth-irrelevant features from channel and space dimensions and utilizes the counterfactual intervention for supervising the attention map to highlight identity-related regions. Finally, a Semantic Alignment Constraint (SAC) is designed to facilitate high-level semantic feature interaction. Comprehensive experiments on four CC-ReID datasets indicate that our method outperforms prior state-of-the-art approaches.
♻ ☆ Unveiling the Pitfalls of Knowledge Editing for Large Language Models ICLR 2024
As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code and data are available at https://github.com/zjunlp/PitfallsKnowledgeEditing.
comment: ICLR 2024
♻ ☆ Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing
Generative 3D part assembly involves understanding part relationships and predicting their 6-DoF poses for assembling a realistic 3D shape. Prior work often focus on the geometry of individual parts, neglecting part-whole hierarchies of objects. Leveraging two key observations: 1) super-part poses provide strong hints about part poses, and 2) predicting super-part poses is easier due to fewer superparts, we propose a part-whole-hierarchy message passing network for efficient 3D part assembly. We first introduce super-parts by grouping geometrically similar parts without any semantic labels. Then we employ a part-whole hierarchical encoder, wherein a super-part encoder predicts latent super-part poses based on input parts. Subsequently, we transform the point cloud using the latent poses, feeding it to the part encoder for aggregating super-part information and reasoning about part relationships to predict all part poses. In training, only ground-truth part poses are required. During inference, the predicted latent poses of super-parts enhance interpretability. Experimental results on the PartNet dataset show that our method achieves state-of-the-art performance in part and connectivity accuracy and enables an interpretable hierarchical part assembly.
♻ ☆ InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields CVPR 2024
We propose InNeRF360, an automatic system that accurately removes text-specified objects from 360-degree Neural Radiance Fields (NeRF). The challenge is to effectively remove objects while inpainting perceptually consistent content for the missing regions, which is particularly demanding for existing NeRF models due to their implicit volumetric representation. Moreover, unbounded scenes are more prone to floater artifacts in the inpainted region than frontal-facing scenes, as the change of object appearance and background across views is more sensitive to inaccurate segmentations and inconsistent inpainting. With a trained NeRF and a text description, our method efficiently removes specified objects and inpaints visually consistent content without artifacts. We apply depth-space warping to enforce consistency across multiview text-encoded segmentations, and then refine the inpainted NeRF model using perceptual priors and 3D diffusion-based geometric priors to ensure visual plausibility. Through extensive experiments in segmentation and inpainting on 360-degree and frontal-facing NeRFs, we show that our approach is effective and enhances NeRF's editability. Project page: https://ivrl.github.io/InNeRF360.
comment: CVPR 2024
♻ ☆ Passive Non-Line-of-Sight Imaging with Light Transport Modulation
Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at https://github.com/JerryOctopus/NLOS-LTM.
♻ ☆ ViT-Lens: Towards Omni-modal Representations CVPR2024
Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.
comment: This work is a follow-up of arXiv:2308.10185. Accepted to CVPR2024
♻ ☆ Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification CVPR 2024
Visible-Infrared Person Re-identification (VI-ReID) is a challenging cross-modal pedestrian retrieval task, due to significant intra-class variations and cross-modal discrepancies among different cameras. Existing works mainly focus on embedding images of different modalities into a unified space to mine modality-shared features. They only seek distinctive information within these shared features, while ignoring the identity-aware useful information that is implicit in the modality-specific features. To address this issue, we propose a novel Implicit Discriminative Knowledge Learning (IDKL) network to uncover and leverage the implicit discriminative information contained within the modality-specific. First, we extract modality-specific and modality-shared features using a novel dual-stream network. Then, the modality-specific features undergo purification to reduce their modality style discrepancies while preserving identity-aware discriminative knowledge. Subsequently, this kind of implicit knowledge is distilled into the modality-shared feature to enhance its distinctiveness. Finally, an alignment loss is proposed to minimize modality discrepancy on enhanced modality-shared features. Extensive experiments on multiple public datasets demonstrate the superiority of IDKL network over the state-of-the-art methods. Code is available at https://github.com/1KK077/IDKL.
comment: CVPR 2024
♻ ☆ In Search of a Data Transformation That Accelerates Neural Field Training CVPR 2024
Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.
comment: CVPR 2024
♻ ☆ AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation CVPR 2024
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on https://choijeongsoo.github.io/av2av.
comment: CVPR 2024. Code & Demo: https://choijeongsoo.github.io/av2av
♻ ☆ SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation
Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action ?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.
comment: Teaser Fixed
♻ ☆ Powerful Lossy Compression for Noisy Images ICME 2024
Image compression and denoising represent fundamental challenges in image processing with many real-world applications. To address practical demands, current solutions can be categorized into two main strategies: 1) sequential method; and 2) joint method. However, sequential methods have the disadvantage of error accumulation as there is information loss between multiple individual models. Recently, the academic community began to make some attempts to tackle this problem through end-to-end joint methods. Most of them ignore that different regions of noisy images have different characteristics. To solve these problems, in this paper, our proposed signal-to-noise ratio~(SNR) aware joint solution exploits local and non-local features for image compression and denoising simultaneously. We design an end-to-end trainable network, which includes the main encoder branch, the guidance branch, and the signal-to-noise ratio~(SNR) aware branch. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that our joint solution outperforms existing state-of-the-art methods.
comment: Accepted by ICME 2024
♻ ☆ ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.
comment: 19 pages, 4 figures and 9 tables
♻ ☆ TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling
Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called \textbf{balance swap-sampling}. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli.
comment: Project page: https://tp2o.github.io/anon/
♻ ☆ Segment and Caption Anything CVPR 24
We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via https://xk-huang.github.io/segment-caption-anything/.
comment: The project page, along with the associated code, can be accessed via https://xk-huang.github.io/segment-caption-anything/; Update author information; Accepted by CVPR 24
♻ ☆ TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification
The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.
♻ ☆ SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.
♻ ☆ ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation
This work introduces ArtAdapter, a transformative text-to-image (T2I) style transfer framework that transcends traditional limitations of color, brushstrokes, and object shape, capturing high-level style elements such as composition and distinctive artistic expression. The integration of a multi-level style encoder with our proposed explicit adaptation mechanism enables ArtAdapter to achieve unprecedented fidelity in style transfer, ensuring close alignment with textual descriptions. Additionally, the incorporation of an Auxiliary Content Adapter (ACA) effectively separates content from style, alleviating the borrowing of content from style references. Moreover, our novel fast finetuning approach could further enhance zero-shot style representation while mitigating the risk of overfitting. Comprehensive evaluations confirm that ArtAdapter surpasses current state-of-the-art methods.
♻ ☆ Clean-image Backdoor Attacks
To gather a significant quantity of annotated training data for high-performance image classification models, numerous companies opt to enlist third-party providers to label their unlabeled data. This practice is widely regarded as secure, even in cases where some annotated errors occur, as the impact of these minor inaccuracies on the final performance of the models is negligible and existing backdoor attacks require attacker's ability to poison the training images. Nevertheless, in this paper, we propose clean-image backdoor attacks which uncover that backdoors can still be injected via a fraction of incorrect labels without modifying the training images. Specifically, in our attacks, the attacker first seeks a trigger feature to divide the training images into two parts: those with the feature and those without it. Subsequently, the attacker falsifies the labels of the former part to a backdoor class. The backdoor will be finally implanted into the target model after it is trained on the poisoned data. During the inference phase, the attacker can activate the backdoor in two ways: slightly modifying the input image to obtain the trigger feature, or taking an image that naturally has the trigger feature as input. We conduct extensive experiments to demonstrate the effectiveness and practicality of our attacks. According to the experimental results, we conclude that our attacks seriously jeopardize the fairness and robustness of image classification models, and it is necessary to be vigilant about the incorrect labels in outsourced labeling.
♻ ☆ Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency
Relative monocular depth, inferring depth up to shift and scale from a single image, is an active research topic. Recent deep learning models, trained on large and varied meta-datasets, now provide excellent performance in the domain of natural images. However, few datasets exist which provide ground truth depth for endoscopic images, making training such models from scratch unfeasible. This work investigates the transfer of these models into the surgical domain, and presents an effective and simple way to improve on standard supervision through the use of temporal consistency self-supervision. We show temporal consistency significantly improves supervised training alone when transferring to the low-data regime of endoscopy, and outperforms the prevalent self-supervision technique for this task. In addition we show our method drastically outperforms the state-of-the-art method from within the domain of endoscopy. We also release our code, model and ensembled meta-dataset, Meta-MED, establishing a strong benchmark for future work.
♻ ☆ Towards Source-free Domain Adaptive Semantic Segmentation via Importance-aware and Prototype-contrast Learning
Domain adaptive semantic segmentation enables robust pixel-wise understanding in real-world driving scenes. Source-free domain adaptation, as a more practical technique, addresses the concerns of data privacy and storage limitations in typical unsupervised domain adaptation methods, making it especially relevant in the context of intelligent vehicles. It utilizes a well-trained source model and unlabeled target data to achieve adaptation in the target domain. However, in the absence of source data and target labels, current solutions cannot sufficiently reduce the impact of domain shift and fully leverage the information from the target data. In this paper, we propose an end-to-end source-free domain adaptation semantic segmentation method via Importance-Aware and Prototype-Contrast (IAPC) learning. The proposed IAPC framework effectively extracts domain-invariant knowledge from the well-trained source model and learns domain-specific knowledge from the unlabeled target domain. Specifically, considering the problem of domain shift in the prediction of the target domain by the source model, we put forward an importance-aware mechanism for the biased target prediction probability distribution to extract domain-invariant knowledge from the source model. We further introduce a prototype-contrast strategy, which includes a prototype-symmetric cross-entropy loss and a prototype-enhanced cross-entropy loss, to learn target intra-domain knowledge without relying on labels. A comprehensive variety of experiments on two domain adaptive semantic segmentation benchmarks demonstrates that the proposed end-to-end IAPC solution outperforms existing state-of-the-art methods. The source code is publicly available at https://github.com/yihong-97/Source-free-IAPC.
comment: Accepted to IEEE Transactions on Intelligent Vehicles (T-IV). The source code is publicly available at https://github.com/yihong-97/Source-free-IAPC
♻ ☆ SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching CVPR 2024
In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset.
comment: Accepted to CVPR 2024. Project website: https://sd4match.active.vision/
♻ ☆ ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes
Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks. Code https://github.com/Muhammad-Huzaifaa/ObjectCompose.git
♻ ☆ Motion Generation from Fine-grained Textual Descriptions
The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.
Information Retrieval
☆ Search and Society: Reimagining Information Access for Radical Futures
Information retrieval (IR) technologies and research are undergoing transformative changes. It is our perspective that the community should accept this opportunity to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others. In this perspective paper, we motivate why the community must consider this radical shift in how we do research and what we work on, and sketch a path forward towards this transformation.
☆ MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation SIGIR
Digital news platforms use news recommenders as the main instrument to cater to the individual information needs of readers. Despite an increasingly language-diverse online community, in which many Internet users consume news in multiple languages, the majority of news recommendation focuses on major, resource-rich languages, and English in particular. Moreover, nearly all news recommendation efforts assume monolingual news consumption, whereas more and more users tend to consume information in at least two languages. Accordingly, the existing body of work on news recommendation suffers from a lack of publicly available multilingual benchmarks that would catalyze development of news recommenders effective in multilingual settings and for low-resource languages. Aiming to fill this gap, we introduce xMIND, an open, multilingual news recommendation dataset derived from the English MIND dataset using machine translation, covering a set of 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. Using xMIND, we systematically benchmark several state-of-the-art content-based neural news recommenders (NNRs) in both zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer scenarios, considering both monolingual and bilingual news consumption patterns. Our findings reveal that (i) current NNRs, even when based on a multilingual language model, suffer from substantial performance losses under ZS-XLT and that (ii) inclusion of target-language data in FS-XLT training has limited benefits, particularly when combined with a bilingual news consumption. Our findings thus warrant a broader research effort in multilingual and cross-lingual news recommendation. The xMIND dataset is available at https://github.com/andreeaiana/xMIND.
comment: Accepted at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)
☆ ArabicaQA: A Comprehensive Dataset for Arabic Question Answering SIGIR 2024
In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.
comment: Accepted at SIGIR 2024
☆ CaseLink: Inductive Graph Learning for Legal Case Retrieval
In case law, the precedents are the relevant cases that are used to support the decisions made by the judges and the opinions of lawyers towards a given case. This relevance is referred to as the case-to-case reference relation. To efficiently find relevant cases from a large case pool, retrieval tools are widely used by legal practitioners. Existing legal case retrieval models mainly work by comparing the text representations of individual cases. Although they obtain a decent retrieval accuracy, the intrinsic case connectivity relationships among cases have not been well exploited for case encoding, therefore limiting the further improvement of retrieval performance. In a case pool, there are three types of case connectivity relationships: the case reference relationship, the case semantic relationship, and the case legal charge relationship. Due to the inductive manner in the task of legal case retrieval, using case reference as input is not applicable for testing. Thus, in this paper, a CaseLink model based on inductive graph learning is proposed to utilise the intrinsic case connectivity for legal case retrieval, a novel Global Case Graph is incorporated to represent both the case semantic relationship and the case legal charge relationship. A novel contrastive objective with a regularisation on the degree of case nodes is proposed to leverage the information carried by the case reference relationship to optimise the model. Extensive experiments have been conducted on two benchmark datasets, which demonstrate the state-of-the-art performance of CaseLink. The code has been released on https://github.com/yanran-tang/CaseLink.
☆ TWOLAR: a TWO-step LLM-Augmented distillation method for passage Reranking
In this paper, we present TWOLAR: a two-stage pipeline for passage reranking based on the distillation of knowledge from Large Language Models (LLM). TWOLAR introduces a new scoring strategy and a distillation process consisting in the creation of a novel and diverse training dataset. The dataset consists of 20K queries, each associated with a set of documents retrieved via four distinct retrieval methods to ensure diversity, and then reranked by exploiting the zero-shot reranking capabilities of an LLM. Our ablation studies demonstrate the contribution of each new component we introduced. Our experimental results show that TWOLAR significantly enhances the document reranking ability of the underlying model, matching and in some cases even outperforming state-of-the-art models with three orders of magnitude more parameters on the TREC-DL test sets and the zero-shot evaluation benchmark BEIR. To facilitate future work we release our data set, finetuned models, and code.
☆ All-in-One: Heterogeneous Interaction Modeling for Cold-Start Rating Prediction
Cold-start rating prediction is a fundamental problem in recommender systems that has been extensively studied. Many methods have been proposed that exploit explicit relations among existing data, such as collaborative filtering, social recommendations and heterogeneous information network, to alleviate the data insufficiency issue for cold-start users and items. However, the explicit relations constructed based on data between different roles may be unreliable and irrelevant, which limits the performance ceiling of the specific recommendation task. Motivated by this, in this paper, we propose a flexible framework dubbed heterogeneous interaction rating network (HIRE). HIRE dose not solely rely on the pre-defined interaction pattern or the manually constructed heterogeneous information network. Instead, we devise a Heterogeneous Interaction Module (HIM) to jointly model the heterogeneous interactions and directly infer the important interactions via the observed data. In the experiments, we evaluate our model under three cold-start settings on three real-world datasets. The experimental results show that HIRE outperforms other baselines by a large margin. Furthermore, we visualize the inferred interactions of HIRE to confirm the contribution of our model.
comment: 14 pages, 9 figures
☆ EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention SIGIR'24
To capture user preference, transformer models have been widely applied to model sequential user behavior data. The core of transformer architecture lies in the self-attention mechanism, which computes the pairwise attention scores in a sequence. Due to the permutation-equivariant nature, positional encoding is used to enhance the attention between token representations. In this setting, the pairwise attention scores can be derived by both semantic difference and positional difference. However, prior studies often model the two kinds of difference measurements in different ways, which potentially limits the expressive capacity of sequence modeling. To address this issue, this paper proposes a novel transformer variant with complex vector attention, named EulerFormer, which provides a unified theoretical framework to formulate both semantic difference and positional difference. The EulerFormer involves two key technical improvements. First, it employs a new transformation function for efficiently transforming the sequence tokens into polar-form complex vectors using Euler's formula, enabling the unified modeling of both semantic and positional information in a complex rotation form.Secondly, it develops a differential rotation mechanism, where the semantic rotation angles can be controlled by an adaptation function, enabling the adaptive integration of the semantic and positional information according to the semantic contexts.Furthermore, a phase contrastive learning task is proposed to improve the anisotropy of contextual representations in EulerFormer. Our theoretical framework possesses a high degree of completeness and generality. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle. Extensive experiments conducted on four public datasets demonstrate the effectiveness and efficiency of our approach.
comment: Accepted for publication in SIGIR'24
☆ Large Language Models Enhanced Collaborative Filtering
Recent advancements in Large Language Models (LLMs) have attracted considerable interest among researchers to leverage these models to enhance Recommender Systems (RSs). Existing work predominantly utilizes LLMs to generate knowledge-rich texts or utilizes LLM-derived embeddings as features to improve RSs. Al- though the extensive world knowledge embedded in LLMs generally benefits RSs, the application can only take limited number of users and items as inputs, without adequately exploiting collaborative filtering information. Considering its crucial role in RSs, one key challenge in enhancing RSs with LLMs lies in providing better collaborative filtering information through LLMs. In this paper, drawing inspiration from the in-context learning and chain of thought reasoning in LLMs, we propose the Large Language Models enhanced Collaborative Filtering (LLM-CF) framework, which distils the world knowledge and reasoning capabilities of LLMs into collaborative filtering. We also explored a concise and efficient instruction-tuning method, which improves the recommendation capabilities of LLMs while preserving their general functionalities (e.g., not decreasing on the LLM benchmark). Comprehensive experiments on three real-world datasets demonstrate that LLM-CF significantly enhances several backbone recommendation models and consistently outperforms competitive baselines, showcasing its effectiveness in distilling the world knowledge and reasoning capabilities of LLM into collaborative filtering.
comment: 11 pages
☆ S+t-SNE - Bringing dimensionality reduction to data streams
We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle infinite data streams. The core idea behind S+t-SNE is to update the t-SNE embedding incrementally as new data arrives, ensuring scalability and adaptability to handle streaming scenarios. By selecting the most important points at each step, the algorithm ensures scalability while keeping informative visualisations. Employing a blind method for drift management adjusts the embedding space, facilitating continuous visualisation of evolving data dynamics. Our experimental evaluations demonstrate the effectiveness and efficiency of S+t-SNE. The results highlight its ability to capture patterns in a streaming scenario. We hope our approach offers researchers and practitioners a real-time tool for understanding and interpreting high-dimensional data.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. We will soon add a link to the final version of this contribution that underwent peer-review and post-acceptance improvements and was presented at IDA2024 (https://ida2024.org/)
☆ Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems
Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications, from e-commerce platforms to streaming services. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets within the RL framework. Recent advancements in offline RLRS provide a solution for how to address these two challenges. However, existing methods mainly rely on the transformer architecture, which, as sequence lengths increase, can introduce challenges associated with computational resources and training costs. Additionally, the prevalent methods employ fixed-length input trajectories, restricting their capacity to capture evolving user preferences. In this study, we introduce a new offline RLRS method to deal with the above problems. We reinterpret the RLRS challenge by modeling sequential decision-making as an inference task, leveraging adaptive masking configurations. This adaptive approach selectively masks input tokens, transforming the recommendation task into an inference challenge based on varying token subsets, thereby enhancing the agent's ability to infer across diverse trajectory lengths. Furthermore, we incorporate a multi-scale segmented retention mechanism that facilitates efficient modeling of long sequences, significantly enhancing computational efficiency. Our experimental analysis, conducted on both online simulator and offline datasets, clearly demonstrates the advantages of our proposed method.
☆ END4Rec: Efficient Noise-Decoupling for Multi-Behavior Sequential Recommendation
In recommendation systems, users frequently engage in multiple types of behaviors, such as clicking, adding to a cart, and purchasing. However, with diversified behavior data, user behavior sequences will become very long in the short term, which brings challenges to the efficiency of the sequence recommendation model. Meanwhile, some behavior data will also bring inevitable noise to the modeling of user interests. To address the aforementioned issues, firstly, we develop the Efficient Behavior Sequence Miner (EBM) that efficiently captures intricate patterns in user behavior while maintaining low time complexity and parameter count. Secondly, we design hard and soft denoising modules for different noise types and fully explore the relationship between behaviors and noise. Finally, we introduce a contrastive loss function along with a guided training strategy to compare the valid information in the data with the noisy signal, and seamlessly integrate the two denoising processes to achieve a high degree of decoupling of the noisy signal. Sufficient experiments on real-world datasets demonstrate the effectiveness and efficiency of our approach in dealing with multi-behavior sequential recommendation.
☆ Document Set Expansion with Positive-Unlabelled Learning Using Intractable Density Estimation LREC
The Document Set Expansion (DSE) task involves identifying relevant documents from large collections based on a limited set of example documents. Previous research has highlighted Positive and Unlabeled (PU) learning as a promising approach for this task. However, most PU methods rely on the unrealistic assumption of knowing the class prior for positive samples in the collection. To address this limitation, this paper introduces a novel PU learning framework that utilizes intractable density estimation models. Experiments conducted on PubMed and Covid datasets in a transductive setting showcase the effectiveness of the proposed method for DSE. Code is available from https://github.com/Beautifuldog01/Document-set-expansion-puDE.
comment: Accepted at LREC-COLING 2024. arXiv admin note: text overlap with arXiv:2401.11145
☆ Touch the Core: Exploring Task Dependence Among Hybrid Targets for Recommendation
As user behaviors become complicated on business platforms, online recommendations focus more on how to touch the core conversions, which are highly related to the interests of platforms. These core conversions are usually continuous targets, such as \textit{watch time}, \textit{revenue}, and so on, whose predictions can be enhanced by previous discrete conversion actions. Therefore, multi-task learning (MTL) can be adopted as the paradigm to learn these hybrid targets. However, existing works mainly emphasize investigating the sequential dependence among discrete conversion actions, which neglects the complexity of dependence between discrete conversions and the final continuous conversion. Moreover, simultaneously optimizing hybrid tasks with stronger task dependence will suffer from volatile issues where the core regression task might have a larger influence on other tasks. In this paper, we study the MTL problem with hybrid targets for the first time and propose the model named Hybrid Targets Learning Network (HTLNet) to explore task dependence and enhance optimization. Specifically, we introduce label embedding for each task to explicitly transfer the label information among these tasks, which can effectively explore logical task dependence. We also further design the gradient adjustment regime between the final regression task and other classification tasks to enhance the optimization. Extensive experiments on two offline public datasets and one real-world industrial dataset are conducted to validate the effectiveness of HTLNet. Moreover, online A/B tests on the financial recommender system also show our model has superior improvement.
☆ Masked Multi-Domain Network: Multi-Type and Multi-Scenario Conversion Rate Prediction with a Single Model CIKM 2023
In real-world advertising systems, conversions have different types in nature and ads can be shown in different display scenarios, both of which highly impact the actual conversion rate (CVR). This results in the multi-type and multi-scenario CVR prediction problem. A desired model for this problem should satisfy the following requirements: 1) Accuracy: the model should achieve fine-grained accuracy with respect to any conversion type in any display scenario. 2) Scalability: the model parameter size should be affordable. 3) Convenience: the model should not require a large amount of effort in data partitioning, subset processing and separate storage. Existing approaches cannot simultaneously satisfy these requirements. For example, building a separate model for each (conversion type, display scenario) pair is neither scalable nor convenient. Building a unified model trained on all the data with conversion type and display scenario included as two features is not accurate enough. In this paper, we propose the Masked Multi-domain Network (MMN) to solve this problem. To achieve the accuracy requirement, we model domain-specific parameters and propose a dynamically weighted loss to account for the loss scale imbalance issue within each mini-batch. To achieve the scalability requirement, we propose a parameter sharing and composition strategy to reduce model parameters from a product space to a sum space. To achieve the convenience requirement, we propose an auto-masking strategy which can take mixed data from all the domains as input. It avoids the overhead caused by data partitioning, individual processing and separate storage. Both offline and online experimental results validate the superiority of MMN for multi-type and multi-scenario CVR prediction. MMN is now the serving model for real-time CVR prediction in UC Toutiao.
comment: CIKM 2023 (larger figures)
☆ MA4DIV: Multi-Agent Reinforcement Learning for Search Result Diversification
The objective of search result diversification (SRD) is to ensure that selected documents cover as many different subtopics as possible. Existing methods primarily utilize a paradigm of "greedy selection", i.e., selecting one document with the highest diversity score at a time. These approaches tend to be inefficient and are easily trapped in a suboptimal state. In addition, some other methods aim to approximately optimize the diversity metric, such as $\alpha$-NDCG, but the results still remain suboptimal. To address these challenges, we introduce Multi-Agent reinforcement learning (MARL) for search result DIVersity, which called MA4DIV. In this approach, each document is an agent and the search result diversification is modeled as a cooperative task among multiple agents. This approach allows for directly optimizing the diversity metrics, such as $\alpha$-NDCG, while achieving high training efficiency. We conducted preliminary experiments on public TREC datasets to demonstrate the effectiveness and potential of MA4DIV. Considering the limited number of queries in public TREC datasets, we construct a large-scale dataset from industry sources and show that MA4DIV achieves substantial improvements in both effectiveness and efficiency than existing baselines on a industrial scale dataset.
☆ AFDGCF: Adaptive Feature De-correlation Graph Collaborative Filtering for Recommendations SIGIR2024
Collaborative filtering methods based on graph neural networks (GNNs) have witnessed significant success in recommender systems (RS), capitalizing on their ability to capture collaborative signals within intricate user-item relationships via message-passing mechanisms. However, these GNN-based RS inadvertently introduce excess linear correlation between user and item embeddings, contradicting the goal of providing personalized recommendations. While existing research predominantly ascribes this flaw to the over-smoothing problem, this paper underscores the critical, often overlooked role of the over-correlation issue in diminishing the effectiveness of GNN representations and subsequent recommendation performance. Up to now, the over-correlation issue remains unexplored in RS. Meanwhile, how to mitigate the impact of over-correlation while preserving collaborative filtering signals is a significant challenge. To this end, this paper aims to address the aforementioned gap by undertaking a comprehensive study of the over-correlation issue in graph collaborative filtering models. Firstly, we present empirical evidence to demonstrate the widespread prevalence of over-correlation in these models. Subsequently, we dive into a theoretical analysis which establishes a pivotal connection between the over-correlation and over-smoothing issues. Leveraging these insights, we introduce the Adaptive Feature De-correlation Graph Collaborative Filtering (AFDGCF) framework, which dynamically applies correlation penalties to the feature dimensions of the representation matrix, effectively alleviating both over-correlation and over-smoothing issues. The efficacy of the proposed framework is corroborated through extensive experiments conducted with four representative graph collaborative filtering models across four publicly available datasets.
comment: Accepted by SIGIR2024
☆ Multi-Domain Recommendation to Attract Users via Domain Preference Modeling AAAI'24
Recently, web platforms have been operating various service domains simultaneously. Targeting a platform that operates multiple service domains, we introduce a new task, Multi-Domain Recommendation to Attract Users (MDRAU), which recommends items from multiple ``unseen'' domains with which each user has not interacted yet, by using knowledge from the user's ``seen'' domains. In this paper, we point out two challenges of MDRAU task. First, there are numerous possible combinations of mappings from seen to unseen domains because users have usually interacted with a different subset of service domains. Second, a user might have different preferences for each of the target unseen domains, which requires that recommendations reflect the user's preferences on domains as well as items. To tackle these challenges, we propose DRIP framework that models users' preferences at two levels (i.e., domain and item) and learns various seen-unseen domain mappings in a unified way with masked domain modeling. Our extensive experiments demonstrate the effectiveness of DRIP in MDRAU task and its ability to capture users' domain-level preferences.
comment: Accepted to AAAI'24
☆ An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders
Sequential Recommendation (SR) aims to predict future user-item interactions based on historical interactions. While many SR approaches concentrate on user IDs and item IDs, the human perception of the world through multi-modal signals, like text and images, has inspired researchers to delve into constructing SR from multi-modal information without using IDs. However, the complexity of multi-modal learning manifests in diverse feature extractors, fusion methods, and pre-trained models. Consequently, designing a simple and universal \textbf{M}ulti-\textbf{M}odal \textbf{S}equential \textbf{R}ecommendation (\textbf{MMSR}) framework remains a formidable challenge. We systematically summarize the existing multi-modal related SR methods and distill the essence into four core components: visual encoder, text encoder, multimodal fusion module, and sequential architecture. Along these dimensions, we dissect the model designs, and answer the following sub-questions: First, we explore how to construct MMSR from scratch, ensuring its performance either on par with or exceeds existing SR methods without complex techniques. Second, we examine if MMSR can benefit from existing multi-modal pre-training paradigms. Third, we assess MMSR's capability in tackling common challenges like cold start and domain transferring. Our experiment results across four real-world recommendation scenarios demonstrate the great potential ID-agnostic multi-modal sequential recommendation. Our framework can be found at: https://github.com/MMSR23/MMSR.
comment: An Empirical Study of Training ID-Agnostic Multi-modal Sequential Recommenders
☆ Cognitively Biased Users Interacting with Algorithmically Biased Results in Whole-Session Search on Controversial Topics
When interacting with information retrieval (IR) systems, users, affected by confirmation biases, tend to select search results that confirm their existing beliefs on socially significant contentious issues. To understand the judgments and attitude changes of users searching online, our study examined how cognitively biased users interact with algorithmically biased search engine result pages (SERPs). We designed three-query search sessions on debated topics under various bias conditions. We recruited 1,321 crowdsourcing participants and explored their attitude changes, search interactions, and the effects of confirmation bias. Three key findings emerged: 1) most attitude changes occur in the initial query of a search session; 2) confirmation bias and result presentation on SERPs affect search behaviors in the current query and perceived familiarity with clicked results in subsequent queries. The bias position also affect attitude changes of users with lower perceived openness to conflicting opinions; 3) Interactions in the first query and and dwell time throughout the session are associated with users' attitude changes in different forms. Our study goes beyond traditional simulation-based evaluation settings and simulated rational users, sheds light on the mixed effects of human biases and algorithmic biases in controversial information retrieval tasks, and can inform the design of bias-aware user models, human-centered bias mitigation techniques, and socially responsible intelligent IR systems.
♻ ☆ A Decade of Scholarly Research on Open Knowledge Graphs LREC
The proliferation of open knowledge graphs has led to a surge in scholarly research on the topic over the past decade. This paper presents a bibliometric analysis of the scholarly literature on open knowledge graphs published between 2013 and 2023. The study aims to identify the trends, patterns, and impact of research in this field, as well as the key topics and research questions that have emerged. The work uses bibliometric techniques to analyze a sample of 4445 scholarly articles retrieved from Scopus. The findings reveal an ever-increasing number of publications on open knowledge graphs published every year, particularly in developed countries (+50 per year). These outputs are published in highly-referred scholarly journals and conferences. The study identifies three main research themes: (1) knowledge graph construction and enrichment, (2) evaluation and reuse, and (3) fusion of knowledge graphs into NLP systems. Within these themes, the study identifies specific tasks that have received considerable attention, including entity linking, knowledge graph embedding, and graph neural networks.
comment: Camera-ready edition for LREC-COLING 2024
♻ ☆ Unsupervised Large Language Model Alignment for Information Retrieval via Contrastive Feedback SIGIR24
Large language models (LLMs) have demonstrated remarkable capabilities across various research domains, including the field of Information Retrieval (IR). However, the responses generated by off-the-shelf LLMs tend to be generic, i.e., cannot capture the distinctiveness of each document with similar content. This limits the performance of LLMs in IR because finding and distinguishing relevant documents from substantial similar documents is a typical problem in many IR tasks. To address this issue, we propose an unsupervised alignment method, namely Reinforcement Learning from Contrastive Feedback (RLCF), empowering LLMs to generate both high-quality and context-specific responses. Our approach constructs unsupervised contrastive feedback signals based on similar document groups, and adopts a reward function, named group-wise reciprocal rank, to optimize LLMs within a standard Proximal Policy Optimization. We conduct extensive experiments to evaluate the effectiveness of RLCF on LLMs built with different languages and parameter sizes on multiple downstream IR applications. RLCF significantly outperforms existing alignment methods, and RLCF-optimized LLMs demonstrate considerable improvement in generating responses with distinctiveness.
comment: Accepted by SIGIR24
♻ ☆ Coarse-Tuning for Ad-hoc Document Retrieval Using Pre-trained Language Models LREC
Fine-tuning in information retrieval systems using pre-trained language models (PLM-based IR) requires learning query representations and query-document relations, in addition to downstream task-specific learning. This study introduces coarse-tuning as an intermediate learning stage that bridges pre-training and fine-tuning. By learning query representations and query-document relations in coarse-tuning, we aim to reduce the load of fine-tuning and improve the learning effect of downstream IR tasks. We propose Query-Document Pair Prediction (QDPP) for coarse-tuning, which predicts the appropriateness of query-document pairs. Evaluation experiments show that the proposed method significantly improves MRR and/or nDCG@5 in four ad-hoc document retrieval datasets. Furthermore, the results of the query prediction task suggested that coarse-tuning facilitated learning of query representation and query-document relations.
comment: Accepted at LREC-COLING 2024
♻ ☆ Optimizing Feature Set for Click-Through Rate Prediction WWW 2023
Click-through prediction (CTR) models transform features into latent vectors and enumerate possible feature interactions to improve performance based on the input feature set. Therefore, when selecting an optimal feature set, we should consider the influence of both feature and its interaction. However, most previous works focus on either feature field selection or only select feature interaction based on the fixed feature set to produce the feature set. The former restricts search space to the feature field, which is too coarse to determine subtle features. They also do not filter useless feature interactions, leading to higher computation costs and degraded model performance. The latter identifies useful feature interaction from all available features, resulting in many redundant features in the feature set. In this paper, we propose a novel method named OptFS to address these problems. To unify the selection of feature and its interaction, we decompose the selection of each feature interaction into the selection of two correlated features. Such a decomposition makes the model end-to-end trainable given various feature interaction operations. By adopting feature-level search space, we set a learnable gate to determine whether each feature should be within the feature set. Because of the large-scale search space, we develop a learning-by-continuation training scheme to learn such gates. Hence, OptFS generates the feature set only containing features which improve the final prediction results. Experimentally, we evaluate OptFS on three public datasets, demonstrating OptFS can optimize feature sets which enhance the model performance and further reduce both the storage and computational cost.
comment: Accepted by WWW 2023 Research Tracks
♻ ☆ Graph Signal Diffusion Model for Collaborative Filtering SIGIR 2024
Collaborative filtering is a critical technique in recommender systems. Among various methods, an increasingly popular paradigm is to reconstruct user-item interactions based on the historical observations. This can be viewed as a conditional generative task, where recently developed diffusion model demonstrates great potential. However, existing studies on diffusion models lack effective solutions for modeling implicit feedback data. Particularly, the isotropic nature of the standard diffusion process fails to account for the heterogeneous dependencies among items, leading to a misalignment with the graphical structure of the interaction space. Meanwhile, random noise destroying personalized information in interaction vectors, causing difficulty in reverse reconstruction. In this paper, we make novel adaptions of diffusion model and propose Graph Signal Diffusion Model for Collaborative Filtering (named GiffCF). To better represent the high-dimensional and sparse distribution of implicit feedback, we define a generalized form of denoising diffusion using heat equation on the item-item similarity graph. Our forward process smooths interaction signals with an advanced family of graph filters. Hence, instead of losing information, it involves item-item similarities as beneficial prior knowledge for recommendation. To reconstruct high-quality interactions, our reverse process iteratively refines and sharpens preference signals in a deterministic manner, where the update direction is conditioned on the user history and computed from a carefully designed two-stage denoiser. Finally, through extensive experiments, we show that GiffCF effectively leverages the advantages of both diffusion model and graph signal processing, and achieves state-of-the-art performance on three benchmark datasets.
comment: 11 pages, 8 figures, Accepted by SIGIR 2024
♻ ☆ Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding
We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with simple FirstP baselines (applying the same model to input truncated to the first 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated models in the zero-shot scenario as well as after fine-tuning on other collections. In our initial experiments with standard collections we found that long-document models underperformed FirstP or outperformed it by at most 5% on average in terms of MRR or NDCG. We then conjectured that this was not due to models inability to process long context but rather due to a positional bias of relevant passages, which tended to be among the first 512 document tokens. We found evidence that this bias was, indeed, present in at least two test sets, which motivated us to create a new collection MS MARCO FarRelevant where the relevant passages were not present among the first 512 tokens. Unlike standard collections where we observed both little benefit from incorporating longer contexts and limited variability in model performance (within a few %), experiments on MS MARCO FarRelevant uncovered dramatic differences among models. FirstP models performed roughly at the random-baseline level in both zero-shot and fine-tuning scenarios. Simple aggregation models (e.g., MaxP) had good zero-shot accuracy but benefited little from fine-tuning. Most other models had poor zero-shot performance (sometimes at a random baseline level) but outstripped MaxP by as much 13-28\% after finetuning. Thus, positional bias not only diminishes benefits of processing longer document contexts but also leads to model overfitting to this bias and performing poorly in a zero-shot setting when a distribution of relevant passages changes substantially. We make our software and MS MARCO FarRelevant available.
♻ ☆ Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction COLING 2024
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction, i.e., prompts tend to introduce biases toward specific labels. Prompt bias presents a significant challenge in assessing the factual knowledge within PLMs. Therefore, this paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias. We show that: 1) all prompts in the experiments exhibit non-negligible bias, with gradient-based prompts like AutoPrompt and OptiPrompt displaying significantly higher levels of bias; 2) prompt bias can amplify benchmark accuracy unreasonably by overfitting the test datasets, especially on imbalanced datasets like LAMA. Based on these findings, we propose a representation-based approach to mitigate the prompt bias during inference time. Specifically, we first estimate the biased representation using prompt-only querying, and then remove it from the model's internal representations to generate the debiased representations, which are used to produce the final debiased outputs. Experiments across various prompts, PLMs, and benchmarks show that our approach can not only correct the overfitted performance caused by prompt bias, but also significantly improve the prompt retrieval capability (up to 10% absolute performance gain). These results indicate that our approach effectively alleviates prompt bias in knowledge evaluation, thereby enhancing the reliability of benchmark assessments. Hopefully, our plug-and-play approach can be a golden standard to strengthen PLMs toward reliable knowledge bases. Code and data are released in https://github.com/FelliYang/PromptBias.
comment: Accepted by COLING 2024
♻ ☆ Improving Sequential Recommendations via Bidirectional Temporal Data Augmentation with Pre-training
Sequential recommendation systems are integral to discerning temporal user preferences. Yet, the task of learning from abbreviated user interaction sequences poses a notable challenge. Data augmentation has been identified as a potent strategy to enhance the informational richness of these sequences. Traditional augmentation techniques, such as item randomization, may disrupt the inherent temporal dynamics. Although recent advancements in reverse chronological pseudo-item generation have shown promise, they can introduce temporal discrepancies when assessed in a natural chronological context. In response, we introduce a sophisticated approach, Bidirectional temporal data Augmentation with pre-training (BARec). Our approach leverages bidirectional temporal augmentation and knowledge-enhanced fine-tuning to synthesize authentic pseudo-prior items that \emph{retain user preferences and capture deeper item semantic correlations}, thus boosting the model's expressive power. Our comprehensive experimental analysis confirms the superiority of BARec across both short and elongated sequence contexts. Moreover, theoretical examination and visual representation of item embeddings offer further insight into the model's logical processes and interpretability. The source code for our study is available at \textcolor{blue}{\href{https://github.com/juyongjiang/BARec}{https://github.com/juyongjiang/BARec}}.
Data Structures and Algorithms
☆ Counting Stars is Constant-Degree Optimal For Detecting Any Planted Subgraph
We study the computational limits of the following general hypothesis testing problem. Let H=H_n be an \emph{arbitrary} undirected graph on n vertices. We study the detection task between a ``null'' Erd\H{o}s-R\'{e}nyi random graph G(n,p) and a ``planted'' random graph which is the union of G(n,p) together with a random copy of H=H_n. Our notion of planted model is a generalization of a plethora of recently studied models initiated with the study of the planted clique model (Jerrum 1992), which corresponds to the special case where H is a k-clique and p=1/2. Over the last decade, several papers have studied the power of low-degree polynomials for limited choices of H's in the above task. In this work, we adopt a unifying perspective and characterize the power of \emph{constant degree} polynomials for the detection task, when \emph{H=H_n is any arbitrary graph} and for \emph{any p=\Omega(1).} Perhaps surprisingly, we prove that the optimal constant degree polynomial is always given by simply \emph{counting stars} in the input random graph. As a direct corollary, we conclude that the class of constant-degree polynomials is only able to ``sense'' the degree distribution of the planted graph H, and no other graph theoretic property of it.
☆ How Private is DP-SGD?
We demonstrate a substantial gap between the privacy guarantees of the Adaptive Batch Linear Queries (ABLQ) mechanism under different types of batch sampling: (i) Shuffling, and (ii) Poisson subsampling; the typical analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) follows by interpreting it as a post-processing of ABLQ. While shuffling based DP-SGD is more commonly used in practical implementations, it is neither analytically nor numerically amenable to easy privacy analysis. On the other hand, Poisson subsampling based DP-SGD is challenging to scalably implement, but has a well-understood privacy analysis, with multiple open-source numerically tight privacy accountants available. This has led to a common practice of using shuffling based DP-SGD in practice, but using the privacy analysis for the corresponding Poisson subsampling version. Our result shows that there can be a substantial gap between the privacy analysis when using the two types of batch sampling, and thus advises caution in reporting privacy parameters for DP-SGD.
☆ Generalising the maximum independent set algorithm via Boolean networks
A simple greedy algorithm to find a maximal independent set (MIS) in a graph starts with the empty set and visits every vertex, adding it to the set if and only if none of its neighbours are already in the set. In this paper, we consider the generalisation of this MIS algorithm by letting it start with any set of vertices and we prove the hardness of many decision problems related to this generalisation. Our results are based on two main strategies. Firstly, we view the MIS algorithm as a sequential update of a Boolean network, which we refer to as the MIS network, according to a permutation of the vertex set. The set of fixed points of the MIS network corresponds to the set of MIS of the graph. Our generalisation then consists in starting from any configuration and following a sequential update given by a word of vertices. Secondly, we introduce the concept of a colony of a graph, that is a set of vertices that is dominated by an independent set. Deciding whether a set of vertices is a colony is NP-complete; decision problems related to the MIS algorithm will be reduced from the Colony problem. We first show that deciding whether a configuration can reach all maximal independent sets is coNP-complete. Second, we consider so-called fixing words, that allow to reach a MIS for any initial configuration, and fixing permutations, which we call permises; deciding whether a permutation is fixing is coNP-complete. Third, we show that deciding whether a graph has a permis is coNP-hard. Finally, we generalise the MIS algorithm to digraphs. The algorithm then uses the so-called kernel network, whose fixed points are the kernels of the digraph. Deciding whether the kernel network of a given digraph is fixable is coNP-hard, even for digraphs that have a kernel. Alternatively, we introduce two fixable Boolean networks whose sets of fixed points contain all kernels.
comment: Significantly extends arxiv:2307.05216
☆ Parameterized Analysis of Bribery in Challenge the Champ Tournaments
Challenge the champ tournaments are one of the simplest forms of competition, where a (initially selected) champ is repeatedly challenged by other players. If a player beats the champ, then that player is considered the new (current) champ. Each player in the competition challenges the current champ once in a fixed order. The champ of the last round is considered the winner of the tournament. We investigate a setting where players can be bribed to lower their winning probability against the initial champ. The goal is to maximize the probability of the initial champ winning the tournament by bribing the other players, while not exceeding a given budget for the bribes. Mattei et al. [Journal of Applied Logic, 2015] showed that the problem can be solved in pseudo-polynomial time, and that it is in XP when parameterized by the number of players. We show that the problem is weakly NP-hard and W[1]-hard when parameterized by the number of players. On the algorithmic side, we show that the problem is fixed-parameter tractable when parameterized either by the number of different bribe values or the number of different probability values. To this end, we establish several results that are of independent interest. In particular, we show that the product knapsack problem is W[1]-hard when parameterized by the number of items in the knapsack, and that constructive bribery for cup tournaments is W[1]-hard when parameterized by the number of players. Furthermore, we present a novel way of designing mixed integer linear programs, ensuring optimal solutions where all variables are integers.
☆ Capacity Provisioning Motivated Online Non-Convex Optimization Problem with Memory and Switching Cost
An online non-convex optimization problem is considered where the goal is to minimize the flow time (total delay) of a set of jobs by modulating the number of active servers, but with a switching cost associated with changing the number of active servers over time. Each job can be processed by at most one fixed speed server at any time. Compared to the usual online convex optimization (OCO) problem with switching cost, the objective function considered is non-convex and more importantly, at each time, it depends on all past decisions and not just the present one. Both worst-case and stochastic inputs are considered; for both cases, competitive algorithms are derived.
♻ ☆ Continuous Non-monotone DR-submodular Maximization with Down-closed Convex Constraint
We investigate the continuous non-monotone DR-submodular maximization problem subject to a down-closed convex solvable constraint. Our first contribution is to construct an example to demonstrate that (first-order) stationary points can have arbitrarily bad approximation ratios, and they are usually on the boundary of the feasible domain. These findings are in contrast with the monotone case where any stationary point yields a $1/2$-approximation (Hassani et al. (2017)). Moreover, this example offers insights on how to design improved algorithms by avoiding bad stationary points, such as the restricted continuous local search algorithm (Chekuri et al. (2014)) and the aided measured continuous greedy (Buchbinder and Feldman (2019)). However, the analyses in the last two algorithms only work for the discrete domain because both need to invoke the inequality that the multilinear extension of any submodular set function is bounded from below by its Lovasz extension. Our second contribution, therefore, is to remove this restriction and show that both algorithms can be extended to the continuous domain while retaining the same approximation ratios, and hence offering improved approximation ratios over those in Bian et al. (2017a). for the same problem. At last, we also include numerical experiments to demonstrate our algorithms on problems arising from machine learning and artificial intelligence.
♻ ☆ A Nonparametric Framework for Online Stochastic Matching with Correlated Arrivals
The design of online algorithms for matching markets and revenue management settings is usually bound by the stochastic prior that the demand process is formed by a fixed-length sequence of queries with unknown types, each drawn independently. This assumption of {\em serial independence} implies that the demand of each type, i.e., the number of queries of a given type, has low variance and is approximately Poisson-distributed. This paper explores more general stochastic models for online edge-weighted matching that depart from the serial independence assumption. We propose two new models, \Indep and \Correl, that capture different forms of serial correlations by combining a nonparametric distribution for the demand with standard assumptions on the arrival patterns -- adversarial or random order. The \Indep model has arbitrary marginal distributions for the demands but assumes cross-sectional independence for the customer types, whereas the \Correl model captures common shocks across customer types. We demonstrate that fluid relaxations, which rely solely on expected demand information, have arbitrarily bad performance guarantees. In contrast, we develop new algorithms that essentially achieve optimal constant-factor performance guarantees in each model. Our mathematical analysis includes tighter linear programming relaxations that leverage distribution knowledge, and a new lossless randomized rounding scheme in the case of $\Indep$. In numerical simulations of the $\Indep$ model, we find that tighter relaxations are beneficial under high-variance demand and that our demand-aware rounding scheme can outperform stockout-aware rounding.
♻ ☆ Fixed-sparsity matrix approximation from matrix-vector products
We study the problem of approximating a matrix $\mathbf{A}$ with a matrix that has a fixed sparsity pattern (e.g., diagonal, banded, etc.), when $\mathbf{A}$ is accessed only by matrix-vector products. We describe a simple randomized algorithm that returns an approximation with the given sparsity pattern with Frobenius-norm error at most $(1+\varepsilon)$ times the best possible error. When each row of the desired sparsity pattern has at most $s$ nonzero entries, this algorithm requires $O(s/\varepsilon)$ non-adaptive matrix-vector products with $\mathbf{A}$. We also prove a matching lower-bound, showing that, for any sparsity pattern with $\Theta(s)$ nonzeros per row and column, any algorithm achieving $(1+\epsilon)$ approximation requires $\Omega(s/\varepsilon)$ matrix-vector products in the worst case. We thus resolve the matrix-vector product query complexity of the problem up to constant factors, even for the well-studied case of diagonal approximation, for which no previous lower bounds were known.
♻ ☆ Adaptive Frequency Bin Interval in FFT via Dense Sampling Factor $α$
The Fast Fourier Transform (FFT) is a fundamental tool for signal analysis, widely used across various fields. However, traditional FFT methods encounter challenges in adjusting the frequency bin interval, which may impede accurate spectral analysis. In this study, we propose a method for adjusting the frequency bin interval in FFT by introducing a parameter $\alpha$. We elucidate the underlying principles of the proposed method and discuss its potential applications across various contexts. Our findings suggest that the proposed method offers a promising approach to overcome the limitations of traditional FFT methods and enhance spectral analysis accuracy.
♻ ☆ Nearly Optimal Fault Tolerant Distance Oracle STOC
We present an $f$-fault tolerant distance oracle for an undirected weighted graph where each edge has an integral weight from $[1 \dots W]$. Given a set $F$ of $f$ edges, as well as a source node $s$ and a destination node $t$, our oracle returns the \emph{shortest path} from $s$ to $t$ avoiding $F$ in $O((cf \log (nW))^{O(f^2)})$ time, where $c > 1$ is a constant. The space complexity of our oracle is $O(f^4n^2\log^2 (nW))$. For a constant $f$, our oracle is nearly optimal both in terms of space and time (barring some logarithmic factor).
comment: accepted in STOC, 2024
♻ ☆ Parallel Self-Avoiding Walks for a Low-Autocorrelation Binary Sequences Problem
A low-autocorrelation binary sequences problem with a high figure of merit factor represents a formidable computational challenge. An efficient parallel computing algorithm is required to reach the new best-known solutions for this problem. Therefore, we developed the $\mathit{sokol}_{\mathit{skew}}$ solver for the skew-symmetric search space. The developed solver takes the advantage of parallel computing on graphics processing units. The solver organized the search process as a sequence of parallel and contiguous self-avoiding walks and achieved a speedup factor of 387 compared with $\mathit{lssOrel}$, its predecessor. The $\mathit{sokol}_{\mathit{skew}}$ solver belongs to stochastic solvers and can not guarantee the optimality of solutions. To mitigate this problem, we established the predictive model of stopping conditions according to the small instances for which the optimal skew-symmetric solutions are known. With its help and 99% probability, the $\mathit{sokol}_{\mathit{skew}}$ solver found all the known and seven new best-known skew-symmetric sequences for odd instances from $L=121$ to $L=223$. For larger instances, the solver can not reach 99% probability within our limitations, but it still found several new best-known binary sequences. We also analyzed the trend of the best merit factor values, and it shows that as sequence size increases, the value of the merit factor also increases, and this trend is flatter for larger instances.
comment: 17 pages, 5 figures
Signal Processing
☆ Enhancing Indoor and Outdoor THz Communications with Beyond Diagonal-IRS: Optimization and Performance Analysis
This work investigates the application of Beyond Diagonal Intelligent Reflective Surface (BD-IRS) to enhance THz downlink communication systems, operating in a hybrid: reflective and transmissive mode, to simultaneously provide services to indoor and outdoor users. We propose an optimization framework that jointly optimizes the beamforming vectors and phase shifts in the hybrid reflective/transmissive mode, aiming to maximize the system sum rate. To tackle the challenges in solving the joint design problem, we employ the conjugate gradient method and propose an iterative algorithm that successively optimizes the hybrid beamforming vectors and the phase shifts. Through comprehensive numerical simulations, our findings demonstrate a significant improvement in rate when compared to existing benchmark schemes, including time- and frequency-divided approaches, by approximately $30.5\%$ and $70.28\%$ respectively. This underscores the significant influence of IRS elements on system performance relative to that of base station antennas, highlighting their pivotal role in advancing the communication system efficacy.
☆ Reciprocity Calibration of Dual-Antenna Repeaters
We present a reciprocity calibration method for dual-antenna repeaters in wireless networks. The method uses bi-directional measurements between two network nodes, A and B, where for each bi-directional measurement, the repeaters are configured in different states. The nodes A and B could be two access points in a distributed MIMO system, or they could be a base station and a mobile user terminal, for example. From the calibration measurements, the differences between the repeaters' forward and reverse gains are estimated. The repeaters are then (re-)configured to compensate for these differences such that the repeaters appear, transparently to the network, as reciprocal components of the propagation environment, enabling reciprocity-based beamforming in the network.
comment: IEEE Wireless Communications Letters, 2024
☆ Scalable Non-Cartesian Magnetic Resonance Imaging with R2D2
We propose a new approach for non-Cartesian magnetic resonance image reconstruction. While unrolled architectures provide robustness via data-consistency layers, embedding measurement operators in Deep Neural Network (DNN) can become impractical at large scale. Alternative Plug-and-Play (PnP) approaches, where the denoising DNNs are blind to the measurement setting, are not affected by this limitation and have also proven effective, but their highly iterative nature also affects scalability. To address this scalability challenge, we leverage the "Residual-to-Residual DNN series for high-Dynamic range imaging (R2D2)" approach recently introduced in astronomical imaging. R2D2's reconstruction is formed as a series of residual images, iteratively estimated as outputs of DNNs taking the previous iteration's image estimate and associated data residual as inputs. The method can be interpreted as a learned version of the Matching Pursuit algorithm. We demonstrate R2D2 in simulation, considering radial k-space sampling acquisition sequences. Our preliminary results suggest that R2D2 achieves: (i) suboptimal performance compared to its unrolled incarnation R2D2-Net, which is however non-scalable due to the necessary embedding of NUFFT-based data-consistency layers; (ii) superior reconstruction quality to a scalable version of R2D2-Net embedding an FFT-based approximation for data consistency; (iii) superior reconstruction quality to PnP, while only requiring few iterations.
comment: submitted to IEEE EUSIPCO 2024
☆ Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio gener- ation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
comment: Submitted to EUSIPCO 2024
☆ Environment Reconstruction based on Multi-User Selection and Multi-Modal Fusion in ISAC
Integrated sensing and communications (ISAC) has been deemed as a key technology for the sixth generation (6G) wireless communications systems. In this paper, we explore the inherent clustered nature of wireless users and design a multi-user based environment reconstruction scheme. Specifically, we first select users based on the estimation precision of channel's multipath, including the line-of-sight (LOS) and the non-line-of-sight (NLOS) paths, to enhance the accuracy of environment reconstruction. Then, we develop a fusion strategy that merges communications signalling with camera image to increase the accuracy and robustness of environment reconstruction. The simulation results demonstrate that the proposed algorithm can achieve a remarkable sensing accuracy of centimeter level, which is about 17 times better than the scheme without user selection. Meanwhile, the fusion of communications data and vision data leads to a threefold accuracy improvement over the image only method, especially under challenging weather conditions like raining and snowing.
☆ Robust Analysis of Full-Duplex Two-Way Space Shift Keying With RIS Systems
Reconfigurable intelligent surface (RIS)-assisted index modulation system schemes are considered a promising technology for sixth-generation (6G) wireless communication systems, which can enhance various system capabilities such as coverage and reliability. However, obtaining perfect channel state information (CSI) is challenging due to the lack of a radio frequency chain in RIS. In this paper, we investigate the RIS-assisted full-duplex (FD) two-way space shift keying (SSK) system under imperfect CSI, where the signal emissions are augmented by deploying RISs in the vicinity of two FD users. The maximum likelihood detector is utilized to recover the transmit antenna index. With this in mind, we derive closed-form average bit error probability (ABEP) expression based on the Gaussian-Chebyshev quadrature (GCQ) method and provide the upper bound and asymptotic ABEP expressions in the presence of channel estimation errors. To gain more insights, we also derive the outage probability and provide the throughput of the proposed scheme with imperfect CSI. The correctness of the analytical derivation results is confirmed via Monte Carlo simulations. It is demonstrated that increasing the number of elements of RIS can significantly improve the ABEP performance of the FD system over the half-duplex (HD) system. Furthermore, in the high SNR region, the ABEP performance of the FD system is better than that of the HD system.
☆ Improving the Spatial Correlation Characteristics of Antenna Arrays using Linear Operators and Wide-band Modelling
The analysis of wireless communication channels at the mmWave, sub-THz and THz bands gives rise to difficulties in the construction of antenna arrays due to the small maximum inter-element spacing constraints at these frequencies. Arrays with uniform spacing greater than half the wavelength for a certain carrier frequency exhibit aliasing side-lobes in the angular domain, prohibiting non-ambiguous estimates of a propagating wave-front's angle of arrival. In this paper, we present how wide-band modelling of the array response is useful in mitigating this spatial aliasing effect. This approach aims to reduce the grating lobes by exploiting the angle- and frequency-dependent phase-shifts observed in the response of the array to a planar wave-front travelling across it. Furthermore, we propose a method by which the spatial correlation characteristics of an array operating at 33 GHz carrier frequency with an instantaneous bandwidth of 1 GHz can be improved such that the angular-domain side-lobes are reduced by 5-10 dB. This method, applicable to arbitrary antenna array manifolds, makes use of a linear operator that is applied to the base-band samples of the channel transfer function measured in space and frequency domains. By means of synthetically simulated arrays, we show that when operating with a bandwidth of 1 GHz, the use of a derived linear operator applied to the array output results in the spatial correlation characteristics approaching those of the array operating at a bandwidth of 12 GHz. Hence, non-ambiguous angle estimates can be obtained in the field without the use of expensive high-bandwidth RF front-end components.
comment: 7 pages, 4 figures, 27th Workshop on Smart Antennas, 2024
☆ Leveraging A Variety of Anchors in Cellular Network for Ubiquitous Sensing
Integrated sensing and communication (ISAC) has recently attracted tremendous attention from both academia and industry, being envisioned as a key part of the standards for the sixth-generation (6G) cellular network. A key challenge of 6G-oriented ISAC lies in how to perform ubiquitous sensing based on the communication signals and devices. Previous works have made great progresses on studying the signal waveform design that leads to optimal communication-sensing performance tradeoff. In this article, we aim to focus on issues arising from the exploitation of the communication devices for sensing in 6G network. Particularly, we will discuss about how to leverage various nodes available in the cellular network as anchors to perform ubiquitous sensing. On one hand, the base stations (BSs) will be the most important anchors in the future 6G ISAC network, since they can generate/process radio signals with high range/angle resolutions, and their positions are precisely known. Correspondingly, we will first study the BS-based sensing technique. On the other hand, the BSs alone may not enable ubiquitous sensing, since they cannot cover all the places with strong line-of-sight (LOS) links. This motivates us to investigate the possibility of using other nodes that are with higher density in the network to act as the anchors. Along this line, we are interested in two types of new anchors - user equipments (UEs) and reconfigurable intelligent surfaces (RISs). This paper will shed light on the opportunities and challenges brought by UE-assisted sensing and RIS-assisted sensing. Our goal is to devise a novel 6G-oriented sensing architecture where BSs, UEs, and RISs can work together to provide ubiquitous sensing services.
comment: to appear in IEEE Communications Magazine
☆ Waveform Design for Joint Communication and SAR Imaging Under Random Signaling
Conventional synthetic aperture radar (SAR) imaging systems typically employ deterministic signal designs, which lack the capability to convey communication information and are thus not suitable for integrated sensing and communication (ISAC) scenarios. In this letter, we propose a joint communication and SAR imaging (JCASAR) system based on orthogonal frequency-division multiplexing (OFDM) signal with cyclic prefix (CP), which is capable of reconstructing the target profile while serving a communication user. In contrast to traditional matched filters, we propose a least squares (LS) estimator for range profiling. Then the SAR image is obtained followed by range cell migration correction (RCMC) and azimuth processing. By minimizing the mean squared error (MSE) of the proposed LS estimator, we investigate the optimal waveform design for SAR imaging, and JCASAR under random signaling, where power allocation strategies are conceived for Gaussian-distributed ISAC signals, in an effort to strike a flexible performance tradeoff between the communication and SAR imaging tasks. Numerical results are provided to validate the effectiveness of the proposed ISAC waveform design for JCASAR systems.
comment: 5 pages
☆ Channel-Adaptive Pilot Design for FDD-MIMO Systems Utilizing Gaussian Mixture Models
In this work, we propose to utilize Gaussian mixture models (GMMs) to design pilots for downlink (DL) channel estimation in frequency division duplex (FDD) systems. The GMM captures prior information during training that is leveraged to design a codebook of pilot matrices in an initial offline phase. Once shared with the mobile terminal (MT), the GMM is utilized to determine a feedback index at the MT in the online phase. This index selects a pilot matrix from a codebook, eliminating the need for online pilot optimization. The GMM is further used for DL channel estimation at the MT via observation-dependent linear minimum mean square error (LMMSE) filters, parametrized by the GMM. The analytic representation of the GMM allows adaptation to any signal-to-noise ratio (SNR) level and pilot configuration without re-training. With extensive simulations, we demonstrate the superior performance of the proposed GMM-based pilot scheme compared to state-of-the-art approaches.
☆ FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids
Predicting and classifying faults in electricity networks is crucial for uninterrupted provision and keeping maintenance costs at a minimum. Thanks to the advancements in the field provided by the smart grid, several data-driven approaches have been proposed in the literature to tackle fault prediction tasks. Implementing these systems brought several improvements, such as optimal energy consumption and quick restoration. Thus, they have become an essential component of the smart grid. However, the robustness and security of these systems against adversarial attacks have not yet been extensively investigated. These attacks can impair the whole grid and cause additional damage to the infrastructure, deceiving fault detection systems and disrupting restoration. In this paper, we present FaultGuard, the first framework for fault type and zone classification resilient to adversarial attacks. To ensure the security of our system, we employ an Anomaly Detection System (ADS) leveraging a novel Generative Adversarial Network training layer to identify attacks. Furthermore, we propose a low-complexity fault prediction model and an online adversarial training technique to enhance robustness. We comprehensively evaluate the framework's performance against various adversarial attacks using the IEEE13-AdvAttack dataset, which constitutes the state-of-the-art for resilient fault prediction benchmarking. Our model outclasses the state-of-the-art even without considering adversaries, with an accuracy of up to 0.958. Furthermore, our ADS shows attack detection capabilities with an accuracy of up to 1.000. Finally, we demonstrate how our novel training layers drastically increase performances across the whole framework, with a mean increase of 154% in ADS accuracy and 118% in model accuracy.
☆ A TDD Distributed MIMO Testbed Using a 1-Bit Radio-Over-Fiber Fronthaul Architecture
We present the uplink and downlink of a time-division duplex distributed multiple-input multiple-output (D-MIMO) testbed, based on a 1-bit radio-over-fiber architecture, which is low-cost and scalable. The proposed architecture involves a central unit (CU) that is equipped with 1-bit digital-to-analog and analog-to-digital converters, operating at 10 GS/s. The CU is connected to multiple single-antenna remote radio heads (RRHs) via optical fibers, over which a binary RF waveform is transmitted. In the uplink, a binary RF waveform is generated at the RRHs by a comparator, whose inputs are the received RF signal and a suitably designed dither signal. In the downlink, a binary RF waveform is generated at the CU via bandpass sigma-delta modulation. Our measurement results show that low error-vector magnitude (EVM) can be achieved in both the uplink and the downlink, despite 1-bit sampling at the CU. Specifically, for point-to-point over-cable transmission between a single user equipment (UE) and a CU equipped with a single RRH, we report, for a 10 MBd signal using single-carrier 16QAM modulation, an EVM of 3.3% in the downlink, and of 4.5% in the uplink. We then consider a CU connected to 3 RRHs serving over the air 2 UEs, and show that, after over-the-air reciprocity calibration, a downlink zero-forcing precoder designed on the basis of uplink channel estimates at the CU, achieves an EVM of 6.4% and 10.9% at UE 1 and UE 2, respectively. Finally, we investigate the ability of the proposed architecture to support orthogonal frequency-division multiplexing (OFDM) waveforms, and its robustness against both in-band and out-of-band interference.
comment: Accepted for publication in IEEE Transactions on Microwave Theory and Techniques
☆ Destination-Constrained Linear Dynamical System Modeling in Set-Valued Frameworks
Directional motion towards a specified destination is a common occurrence in physical processes and human societal activities. Utilizing this prior information can significantly improve the control and predictive performance of system models. This paper primarily focuses on reconstructing linear dynamic system models based on destination constraints in the set-valued framework. We treat destination constraints as inherent information in the state evolution process and employ convex optimization techniques to construct a coherent and robust state model. This refined model effectively captures the impact of destination constraints on the state evolution at each time step. Furthermore, we design an optimal weight matrix for the reconstructed model to ensure smoother and more natural trajectories of state evolution. We also analyze the theoretical guarantee of optimality for this weight matrix and the properties of the reconstructed model. Finally, simulation experiments verify that the reconstructed model has significant advantages over the unconstrained and unoptimized weighted models and constrains the evolution of state trajectories with different starting and ending points.
comment: 15 pages, 11 figures
☆ Unsupervised Learning for Joint Beamforming Design in RIS-aided ISAC Systems
It is critical to design efficient beamforming in reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) systems for enhancing spectrum utilization. However, conventional methods often have limitations, either incurring high computational complexity due to iterative algorithms or sacrificing performance when using heuristic methods. To achieve both low complexity and high spectrum efficiency, an unsupervised learning-based beamforming design is proposed in this work. We tailor image-shaped channel samples and develop an ISAC beamforming neural network (IBF-Net) model for beamforming. By leveraging unsupervised learning, the loss function incorporates key performance metrics like sensing and communication channel correlation and sensing channel gain, eliminating the need of labeling. Simulations show that the proposed method achieves competitive performance compared to benchmarks while significantly reduces computational complexity.
comment: 5 pages, 4 figures, references added
☆ On the Impact of Random Node Sampling on Adaptive Diffusion Networks
In this paper, we analyze the effects of random sampling on adaptive diffusion networks. These networks consist in a collection of nodes that can measure and process data, and that can communicate with each other to pursue a common goal of estimating an unknown system. In particular, we consider in our theoretical analysis the diffusion least-mean-squares algorithm in a scenario in which the nodes are randomly sampled. Hence, each node may or may not adapt its local estimate at a certain iteration. Our model shows that, if the nodes cooperate, a reduction in the sampling probability leads to a slight decrease in the steady-state Network Mean-Square Deviation (NMSD), assuming that the environment is stationary and that all other parameters of the algorithm are kept fixed. Furthermore, under certain circumstances, this can also ensure the stability of the algorithm in situations in which it would otherwise be unstable. Although counter-intuitive, our findings are backed by simulation results, which match the theoretical curves well.
comment: Submitted to the IEEE Transactions on Signal Processing
♻ ☆ Room Transfer Function Reconstruction Using Complex-valued Neural Networks and Irregularly Distributed Microphones
Reconstructing the room transfer functions needed to calculate the complex sound field in a room has several impor- tant real-world applications. However, an unpractical number of microphones is often required. Recently, in addition to classical signal processing methods, deep learning techniques have been applied to reconstruct the room transfer function starting from a very limited set of measurements at scattered points in the room. In this paper, we employ complex-valued neural networks to estimate room transfer functions in the frequency range of the first room resonances, using a few irregularly distributed microphones. To the best of our knowledge, this is the first time that complex-valued neural networks are used to estimate room transfer functions. To analyze the benefits of applying complex- valued optimization to the considered task, we compare the proposed technique with a state-of-the-art kernel-based signal processing approach for sound field reconstruction, showing that the proposed technique exhibits relevant advantages in terms of phase accuracy and overall quality of the reconstructed sound field. For informative purposes, we also compare the model with a similarly-structured data-driven approach that, however, applies a real-valued neural network to reconstruct only the magnitude of the sound field.
comment: Submitted to EUSIPCO 2024
♻ ☆ On the Accuracy of Phase Extraction from a Known-Frequency Noisy Sinusoidal Signal SP
Accurate phase extraction from sinusoidal signals is a crucial task in various signal processing applications. While prior research predominantly addresses the case of asynchronous sampling with unknown signal frequency, this study focuses on the more specific situation where synchronous sampling is possible, and the signal's frequency is known. In this framework, a comprehensive analysis of phase estimation accuracy in the presence of both additive and phase noises is presented. A closed-form expression for the asymptotic Probability Density Function (PDF) of the resulting phase estimator is presented, and validated by simulations that depict Root Mean Square Error (RMSE) trends in different noise scenarios. The latter estimator is asymptotically efficient, exhibiting fast convergence towards its Cram\'er-Rao Lower Bound (CRLB). Three distinct RMSE behaviours were identified depending on the Signal to Noise Ratio (SNR), sample count (N), and noise level: (i) saturation towards a random guess at low SNR values, (ii) linear decreasing relationship with the square roots of N and SNR at moderate noise levels, and (iii) saturation at high SNR towards a noise floor function of the phase noise level. By quantifying the impact of sample count, additive noise, and phase noise on phase estimation accuracy, this work provides valuable insights for designing systems that require precise phase extraction, such as phase-based fluorescence assays or system identification.
comment: 12 pages, 11 figures, initially submitted to Analog Integrated Circuits and Signal Processing the 19th December 2023, Updated 2024/02/07: added watermark, Updated 2024/03/26 : due to a lack of response from AICSP despite several follow-up, we withdrew it, and submitted it to IEEE Transactions on Signal Processing instead
♻ ☆ Discretized Distributed Optimization over Dynamic Digraphs
We consider a discrete-time model of continuous-time distributed optimization over dynamic directed-graphs (digraphs) with applications to distributed learning. Our optimization algorithm works over general strongly connected dynamic networks under switching topologies, e.g., in mobile multi-agent systems and volatile networks due to link failures. Compared to many existing lines of work, there is no need for bi-stochastic weight designs on the links. The existing literature mostly needs the link weights to be stochastic using specific weight-design algorithms needed both at the initialization and at all times when the topology of the network changes. This paper eliminates the need for such algorithms and paves the way for distributed optimization over time-varying digraphs. We derive the bound on the gradient-tracking step-size and discrete time-step for convergence and prove dynamic stability using arguments from consensus algorithms, matrix perturbation theory, and Lyapunov theory. This work, particularly, is an improvement over existing stochastic-weight undirected networks in case of link removal or packet drops. This is because the existing literature may need to rerun time-consuming and computationally complex algorithms for stochastic design, while the proposed strategy works as long as the underlying network is weight-symmetric and balanced. The proposed optimization framework finds applications to distributed classification and learning.
♻ ☆ When Robotics Meets Wireless Communications: An Introductory Tutorial
The importance of ground Mobile Robots (MRs) and Unmanned Aerial Vehicles (UAVs) within the research community, industry, and society is growing fast. Many of these agents are nowadays equipped with communication systems that are, in some cases, essential to successfully achieve certain tasks. In this context, we have begun to witness the development of a new interdisciplinary research field at the intersection of robotics and communications. This research field has been boosted by the intention of integrating UAVs within the 5G and 6G communication networks. This research will undoubtedly lead to many important applications in the near future. Nevertheless, one of the main obstacles to the development of this research area is that most researchers address these problems by oversimplifying either the robotics or the communications aspect. This impedes the ability of reaching the full potential of this new interdisciplinary research area. In this tutorial, we present some of the modelling tools necessary to address problems involving both robotics and communication from an interdisciplinary perspective. As an illustrative example of such problems, we focus in this tutorial on the issue of communication-aware trajectory planning.
comment: 35 pages, 192 references
♻ ☆ Adaptive Frequency Bin Interval in FFT via Dense Sampling Factor $α$
The Fast Fourier Transform (FFT) is a fundamental tool for signal analysis, widely used across various fields. However, traditional FFT methods encounter challenges in adjusting the frequency bin interval, which may impede accurate spectral analysis. In this study, we propose a method for adjusting the frequency bin interval in FFT by introducing a parameter $\alpha$. We elucidate the underlying principles of the proposed method and discuss its potential applications across various contexts. Our findings suggest that the proposed method offers a promising approach to overcome the limitations of traditional FFT methods and enhance spectral analysis accuracy.
♻ ☆ Redundancy Transmission in UAV-Aided LoRa Networks Featuring Wake-Up Radios
We consider a LoRa sensor network featuring a UAV-mounted gateway for collecting sensor data (messages). Wake-up radios (WuR) are employed to inform the sensors of the UAV's arrival. Building on an existing random access scheme for such setups, we propose and evaluate two redundancy transmission protocols for enhancing the reliability of the data transfer. One protocol employs fountain-coded transmissions, whereas the other performs message replication. Our results illustrate how redundancy transmission can be beneficial under the time constraints imposed by the UAV's limited hovering duration, and how node density, hovering time, and sensor energy budget impact the performance of the schemes.
♻ ☆ OTFS-based Robust MMSE Precoding Design in Over-the-air Computation
Over-the-air computation (AirComp), as a data aggregation method that can improve network efficiency by exploiting the superposition characteristics of wireless channels, has received much attention recently. Meanwhile, the orthogonal time frequency space (OTFS) modulation can provide a strong Doppler resilience and facilitate reliable transmission for high-mobility communications. Hence, in this work, we investigate an OTFS-based AirComp system in the presence of time-frequency dual-selective channels. In particular, we commence from the development of a novel transmission framework for the considered system, where the pilot signal is sent together with data, and the channel estimation is implemented according to the echo from the access point to the sensor, thereby reducing the overhead of channel state information (CSI) feedback. Hereafter, based on the CSI estimated from the previous frame, a robust precoding matrix aiming at minimizing mean square error in the current frame is designed, which takes into account the estimation error from the receiver noise and the outdated CSI. The simulation results demonstrate the effectiveness of the proposed robust precoding scheme by comparing it with the non-robust precoding. The performance gain is more obvious in a high signal-to-noise ratio in case of large channel estimation errors.
♻ ☆ Digital Twin Channel for 6G: Concepts, Architectures and Potential Applications
Digital twin channel (DTC) is the real-time mapping of a wireless channel from the physical world to the digital world, which is expected to provide significant performance enhancements for the sixth-generation (6G) air-interface design. In this work, we first define five evolution levels of channel twins with the progression of wireless communication. The fifth level, autonomous DTC, is elaborated with multi-dimensional factors such as methodology, characterization precision, and data category. Then, we provide detailed insights into the requirements and architecture of a complete DTC for 6G. Subsequently, a sensing-enhanced real-time channel prediction platform and experimental validations are exhibited. Finally, drawing from the vision of the 6G network, we explore the potential applications and the open issues in future DTC research.
comment: 7 pages, 5 figures, 15 references. It will be submitted to IEEE journal
♻ ☆ Estimation of Complex Valued Laplacian Matrices for Topology Identification in Power Systems
In this paper, we investigate the problem of estimating a complex-valued Laplacian matrix with a focus on its application in the estimation of admittance matrices in power systems. The proposed approach is based on a constrained maximum likelihood estimator (CMLE) of the complex-valued Laplacian, which is formulated as an optimization problem with Laplacian and sparsity constraints. The complex-valued Laplacian is a symmetric, non-Hermitian matrix that exhibits a joint sparsity pattern between its real and imaginary parts. Thus, we present a group-sparse-based penalized log-likelihood approach for the Laplacian estimation. Leveraging the mixed \ell 2,1 norm relaxation of the joint sparsity constraint, we develop a new alternating direction method of multipliers (ADMM) estimation algorithm for the implementation of the CMLE of the Laplacian matrix under a linear Gaussian model. Next, we apply the proposed ADMM algorithms for the problem of estimating the admittance matrix under three commonly used measurement models that stem from Kirchhoff and Ohm laws, each with different assumptions and simplifications: 1) the nonlinear alternating current (AC) model; 2) the decoupled linear power flow (DLPF) model; and 3) the direct current (DC) model. The performance of the ADMM algorithm is evaluated using data from the IEEE 33-bus power system data under different settings. The numerical experiments demonstrate that the proposed algorithm outperforms existing methods in terms of mean-squared-error (MSE) and F-score, thus providing a more accurate recovery of the admittance matrix.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ RIS-assisted Cell-Free Massive MIMO Systems With Two-Timescale Design and Hardware Impairments
Integrating the reconfigurable intelligent surface (RIS) into a cell-free massive multiple-input multiple-output (CF-mMIMO) system is an effective solution to achieve high system capacity with low cost and power consumption. However, existing works of RIS-assisted systems mostly assumed perfect hardware, while the impact of hardware impairments (HWIs) is generally ignored. In this paper, we consider the general Rician fading channel and uplink transmission of the RIS-assisted CF-mMIMO system under transceiver impairments and RIS phase noise. To reduce the feedback overhead and power consumption, we propose a two-timescale transmission scheme to optimize the passive beamformers at RISs with statistical channel state information (CSI), while transmit beamformers at access points (APs) are designed based on instantaneous CSI. Also, the maximum ratio combining (MRC) detection is applied to the central processing unit (CPU). On this basis, we derive the closed-form approximate expression of the achievable rate, based on which the impact of HWIs and the power scaling laws are analyzed to draw useful theoretical insights. To maximize the users' sum rate or minimum rate, we first transform our rate expression into a tractable form, and then optimize the phase shifts of RISs based on an accelerated gradient ascent method. Finally, numerical results are presented to demonstrate the correctness of our derived expressions and validate the previous analysis, which provide some guidelines for the practical application of the imperfect RISs in the CF-mMIMO with transceiver HWIs.
comment: 51 pages, 11 figures
♻ ☆ Metasurface-Enabled Multifunctional Single-Frequency Sensors without External Power
IoT sensors are crucial for visualizing multidimensional and multimodal information and enabling future IT applications/services such as cyber-physical space, digital twins, autonomous driving, smart cities, and virtual/augmented reality (VR or AR). However, IoT sensors need to be battery-free to realistically manage and maintain the growing number of available sensing devices. Here, we provide a novel sensor design approach that employs metasurfaces to enable multifunctional sensing without requiring an external power source. Importantly, unlike existing metasurface-based sensors, our metasurfaces can sense multiple physical parameters even at a fixed frequency by breaking classic harmonic oscillations in the time domain, making the proposed sensors viable for usage with limited frequency resources. Moreover, we provide a method for predicting physical parameters using the machine learning-based approach of random forest regression. The sensing performance was confirmed by estimating temperature and light intensity, and excellent determination coefficients larger than 0.96 were achieved. Our study affords new opportunities for sensing multiple physical properties without relying on an external power source or needing multiple frequencies, which markedly simplifies and facilitates the design of next-generation wireless communication systems.
comment: 47 pages, 23 figures
♻ ☆ Identification of Craving Maps among Marijuana Users via the Analysis of Functional Brain Networks with High-Order Attention Graph Neural Networks
The excessive consumption of marijuana can induce substantial psychological and social consequences. In this investigation, we propose an elucidative framework termed high-order graph attention neural networks (HOGANN) for the classification of Marijuana addiction, coupled with an analysis of localized brain network communities exhibiting abnormal activities among chronic marijuana users. HOGANN integrates dynamic intrinsic functional brain networks, estimated from resting-state functional magnetic resonance imaging (rs-fMRI), using long short-term memory (LSTM) to capture temporal network dynamics. We employ a high-order attention module for information fusion and message passing among neighboring nodes, enhancing the network community analysis. Our model is validated across two distinct data cohorts, yielding substantially higher classification accuracy than benchmark algorithms. Furthermore, we discern the most pertinent subnetworks and cognitive regions affected by persistent marijuana consumption, indicating adverse effects on functional brain networks, particularly within the dorsal attention and frontoparietal networks. Intriguingly, our model demonstrates superior performance in cohorts exhibiting prolonged dependence, implying that prolonged marijuana usage induces more pronounced alterations in brain networks. The model proficiently identifies craving brain maps, thereby delineating critical brain regions for analysis.
♻ ☆ Integrated Sensing and Communications Towards Proactive Beamforming in mmWave V2I via Multi-Modal Feature Fusion (MMFF)
The future of vehicular communication networks relies on mmWave massive multi-input-multi-output antenna arrays for intensive data transfer and massive vehicle access. However, reliable vehicle-to-infrastructure links require exact alignment between the narrow beams, which traditionally involves excessive signaling overhead. To address this issue, we propose a novel proactive beamforming scheme that integrates multi-modal sensing and communications via Multi-Modal Feature Fusion Network (MMFF-Net), which is composed of multiple neural network components with distinct functions. Unlike existing methods that rely solely on communication processing, our approach obtains comprehensive environmental features to improve beam alignment accuracy. We verify our scheme on the Vision-Wireless (ViWi) dataset, which we enriched with realistic vehicle drifting behavior. Our proposed MMFF-Net achieves more accurate and stable angle prediction, which in turn increases the achievable rates and reduces the communication system outage probability. Even in complex dynamic scenarios with adverse environment conditions, robust prediction results can be guaranteed, demonstrating the feasibility and practicality of the proposed proactive beamforming approach.
comment: 14 pages, 12 figures, 5 tables
♻ ☆ Time-Frequency-Space Transmit Design and Receiver Processing with Dynamic Subarray for Terahertz Integrated Sensing and Communication
Terahertz (THz) integrated sensing and communication (ISAC) enables simultaneous data transmission with Terabit-per-second (Tbps) rate and millimeter-level accurate sensing. To realize such a blueprint, ultra-massive antenna arrays with directional beamforming are used to compensate for severe path loss in the THz band. In this paper, the time-frequency-space transmit design is investigated for THz ISAC to generate time-varying scanning sensing beams and stable communication beams. Specifically, with the dynamic array-of-subarray (DAoSA) hybrid beamforming architecture and multi-carrier modulation, two ISAC hybrid precoding algorithms are proposed, namely, a vectorization (VEC) based algorithm that outperforms existing ISAC hybrid precoding methods and a low-complexity sensing codebook assisted (SCA) approach. Meanwhile, coupled with the transmit design, parameter estimation algorithms are proposed to realize high-accuracy sensing, including a wideband DAoSA MUSIC method for angle estimation and a sum-DFT-GSS approach for range and velocity estimation. Numerical results indicate that the proposed algorithms can realize centi-degree-level angle estimation accuracy and millimeter-level range estimation accuracy, which are one or two orders of magnitudes better than the methods in the millimeter-wave band. In addition, to overcome the cyclic prefix limitation and Doppler effects, an inter-symbol interference- and inter-carrier interference-tackled sensing algorithm is developed to refine sensing capabilities for THz ISAC.
♻ ☆ Near-Field Multiuser Beam-Training for Extremely Large-Scale MIMO Systems
Extremely large-scale multiple-input multiple-output (XL-MIMO) systems are capable of improving spectral efficiency by employing far more antennas than conventional massive MIMO at the base station (BS). However, beam training in multiuser XL-MIMO systems is challenging. To tackle these issues, we conceive a three-phase graph neural network (GNN)-based beam training scheme for multiuser XL-MIMO systems. In the first phase, only far-field wide beams have to be tested for each user and the GNN is utilized to map the beamforming gain information of the far-field wide beams to the optimal near-field beam for each user. In addition, the proposed GNN-based scheme can exploit the position-correlation between adjacent users for further improvement of the accuracy of beam training. In the second phase, a beam allocation scheme based on the probability vectors produced at the outputs of GNNs is proposed to address the above beam-direction conflicts between users. In the third phase, the hybrid TBF is designed for further reducing the inter-user interference. Our simulation results show that the proposed scheme improves the beam training performance of the benchmarks. Moreover, the performance of the proposed beam training scheme approaches that of an exhaustive search, despite requiring only about 7% of the pilot overhead.
comment: submitted to IEEE
Sound
☆ Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio gener- ation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
comment: Submitted to EUSIPCO 2024
☆ Deep functional multiple index models with an application to SER
Speech Emotion Recognition (SER) plays a crucial role in advancing human-computer interaction and speech processing capabilities. We introduce a novel deep-learning architecture designed specifically for the functional data model known as the multiple-index functional model. Our key innovation lies in integrating adaptive basis layers and an automated data transformation search within the deep learning framework. Simulations for this new model show good performances. This allows us to extract features tailored for chunk-level SER, based on Mel Frequency Cepstral Coefficients (MFCCs). We demonstrate the effectiveness of our approach on the benchmark IEMOCAP database, achieving good performance compared to existing methods.
comment: 5 pages, 1 figure
☆ Detection of Deepfake Environmental Audio
With the ever-rising quality of deep generative models, it is increasingly important to be able to discern whether the audio data at hand have been recorded or synthesized. Although the detection of fake speech signals has been studied extensively, this is not the case for the detection of fake environmental audio. We propose a simple and efficient pipeline for detecting fake environmental sounds based on the CLAP audio embedding. We evaluate this detector using audio data from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-art synthesizers can be detected on average with 98% accuracy. We show that using an audio embedding learned on environmental audio is beneficial over a standard VGGish one as it provides a 10% increase in detection performance. Informal listening to Incorrect Negative examples demonstrates audible features of fake sounds missed by the detector such as distortion and implausible background noise.
☆ Speaker Distance Estimation in Enclosures from Single-Channel Audio
Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the obtained sound source position. Towards this direction, in this paper we propose a novel approach for continuous distance estimation from audio signals using a convolutional recurrent neural network with an attention module. The attention mechanism enables the model to focus on relevant temporal and spectral features, enhancing its ability to capture fine-grained distance-related information. To evaluate the effectiveness of our proposed method, we conduct extensive experiments using audio recordings in controlled environments with three levels of realism (synthetic room impulse response, measured response with convolved speech, and real recordings) on four datasets (our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental results show that the model achieves an absolute error of 0.11 meters in a noiseless synthetic scenario. Moreover, the results showed an absolute error of about 1.30 meters in the hybrid scenario. The algorithm's performance in the real scenario, where unpredictable environmental factors and noise are prevalent, yields an absolute error of approximately 0.50 meters. For reproducible research purposes we make model, code, and synthetic datasets available at https://github.com/michaelneri/audio-distance-estimation.
comment: Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
☆ Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
This paper explores whether considering alternative domain-specific embeddings to calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with perceptual ratings of environmental sounds. We used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are tailored for either music or environmental sound evaluation. The FAD scores were calculated for sounds from the DCASE 2023 Task 7 dataset. Using perceptual data from the same task, we find that PANNs-WGM-LogMel produces the best correlation between FAD scores and perceptual ratings of both audio quality and perceived fit with a Spearman correlation higher than 0.5. We also find that music-specific embeddings resulted in significantly lower results. Interestingly, VGGish, the embedding used for the original Fr\'echet calculation, yielded a correlation below 0.1. These results underscore the critical importance of the choice of embedding for the FAD metric design.
☆ Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge CVPR 2024
The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL
comment: Accepted at CVPR 2024
☆ Infrastructure-less Localization from Indoor Environmental Sounds Based on Spectral Decomposition and Spatial Likelihood Model
Human and/or asset tracking using an attached sensor units helps understand their activities. Most common indoor localization methods for human tracking technologies require expensive infrastructures, deployment and maintenance. To overcome this problem, environmental sounds have been used for infrastructure-free localization. While they achieve room-level classification, they suffer from two problems: low signal-to-noise-ratio (SNR) condition and non-uniqueness of sound over the coverage area. A microphone localization method was proposed using supervised spectral decomposition and spatial likelihood to solve these problems. The proposed method was evaluated with actual recordings in an experimental room with a size of 12 x 30 m. The results showed that the proposed method with supervised NMF was robust under low-SNR condition compared to a simple feature (mel frequency cepstrum coefficient: MFCC). Additionally, the proposed method could be easily integrated with prior distribution, which is available from other Bayesian localizations. The proposed method can be used to evaluate the spatial likelihood from environmental sounds.
comment: 6 pages, 6 figures, accepted to IEEE/SICE SII 2023
☆ Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks
This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.
comment: Accepted by IEEE Transactions on Audio, Speech and Language Processing. arXiv admin note: substantial text overlap with arXiv:2211.15974
☆ Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer
In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectrogram to analyze the correlation of frequencies over time. This type of patch allows us to correlate the most relevant frequencies for a particular emotion with the time they were uttered. ii) We propose the use of image coordinate encoding, an absolute positional encoding suitable for ViT. By normalizing the x, y coordinates of the image to -1 to 1 and concatenating them to the image, we can effectively provide valid absolute positional information for ViT. iii) Through feature map matching, the locality and location information of the teacher network is effectively transmitted to the student network. Teacher network is a ViT that contains locality of convolutional stem and absolute position information through image coordinate encoding, and student network is a structure that lacks positional encoding in the basic ViT structure. In feature map matching stage, we train through the mean absolute error (L1 loss) to minimize the difference between the feature maps of the two networks. To validate the proposed method, three emotion datasets (SAVEE, EmoDB, and CREMA-D) consisting of speech were converted into log-Mel spectrograms for comparison experiments. The experimental results show that the proposed method significantly outperforms the state-of-the-art methods in terms of weighted accuracy while requiring significantly fewer floating point operations (FLOPs). Overall, the proposed method offers an promising solution for SER by providing improved efficiency and performance.
♻ ☆ Room Transfer Function Reconstruction Using Complex-valued Neural Networks and Irregularly Distributed Microphones
Reconstructing the room transfer functions needed to calculate the complex sound field in a room has several impor- tant real-world applications. However, an unpractical number of microphones is often required. Recently, in addition to classical signal processing methods, deep learning techniques have been applied to reconstruct the room transfer function starting from a very limited set of measurements at scattered points in the room. In this paper, we employ complex-valued neural networks to estimate room transfer functions in the frequency range of the first room resonances, using a few irregularly distributed microphones. To the best of our knowledge, this is the first time that complex-valued neural networks are used to estimate room transfer functions. To analyze the benefits of applying complex- valued optimization to the considered task, we compare the proposed technique with a state-of-the-art kernel-based signal processing approach for sound field reconstruction, showing that the proposed technique exhibits relevant advantages in terms of phase accuracy and overall quality of the reconstructed sound field. For informative purposes, we also compare the model with a similarly-structured data-driven approach that, however, applies a real-valued neural network to reconstruct only the magnitude of the sound field.
comment: Submitted to EUSIPCO 2024
♻ ☆ Measuring Entrainment in Spontaneous Code-switched Speech NAACL 2024
It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken monolingual domains. More recent work on code-switched communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such studies of entrainment in code-switched domains have been extremely few and restricted to human-machine textual interactions. Our work studies code-switched spontaneous speech between humans, finding that (1) patterns of written and spoken entrainment in monolingual settings largely generalize to code-switched settings, and (2) some patterns of entrainment on code-switching in dialogue agent-generated text generalize to spontaneous code-switched speech. Our findings give rise to important implications for the potentially "universal" nature of entrainment as a communication phenomenon, and potential applications in inclusive and interactive speech technology.
comment: Edits: camera-ready manuscript for NAACL 2024
♻ ☆ As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli
As synthetic media becomes progressively more realistic and barriers to using it continue to lower, the technology has been increasingly utilized for malicious purposes, from financial fraud to nonconsensual pornography. Today, the principal defense against being misled by synthetic media relies on the ability of the human observer to visually and auditorily discern between real and fake. However, it remains unclear just how vulnerable people actually are to deceptive synthetic media in the course of their day to day lives. We conducted a perceptual study with 1276 participants to assess how accurate people were at distinguishing synthetic images, audio only, video only, and audiovisual stimuli from authentic. To reflect the circumstances under which people would likely encounter synthetic media in the wild, testing conditions and stimuli emulated a typical online platform, while all synthetic media used in the survey was sourced from publicly accessible generative AI technology. We find that overall, participants struggled to meaningfully discern between synthetic and authentic content. We also find that detection performance worsens when the stimuli contains synthetic content as compared to authentic content, images featuring human faces as compared to non face objects, a single modality as compared to multimodal stimuli, mixed authenticity as compared to being fully synthetic for audiovisual stimuli, and features foreign languages as compared to languages the observer is fluent in. Finally, we also find that prior knowledge of synthetic media does not meaningfully impact their detection performance. Collectively, these results indicate that people are highly susceptible to being tricked by synthetic media in their daily lives and that human perceptual detection capabilities can no longer be relied upon as an effective counterdefense.
comment: For study pre-registration, see https://osf.io/fnhr3
♻ ☆ Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives CVPR2024
We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture, and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion, which focuses on comprehending the coarse-level music-dance correlation and production characteristic dance primitives. In contrast, the second-stage is the local diffusion, which parallelly generates detailed motion sequences under the guidance of the dance primitives and choreographic rules. In addition, we propose a Foot Refine Block to optimize the contact between the feet and the ground, enhancing the physical realism of the motion. Our approach can parallelly generate dance sequences of extremely long length, striking a balance between global choreographic patterns and local motion quality and expressiveness. Extensive experiments validate the efficacy of our method.
comment: Accepted by CVPR2024, Project page: https://li-ronghui.github.io/lodge
Robotics
☆ SLEDGE: Synthesizing Simulation Environments for Driving Agents with Generative Models
SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder (RVAE). It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500$\times$ less storage to set up (<4GB), making it a more accessible option and helping with democratizing future research in this field.
☆ Multi-Agent Clarity-Aware Dynamic Coverage with Gaussian Processes
This paper presents two algorithms for multi-agent dynamic coverage in spatiotemporal environments, where the coverage algorithms are informed by the method of data assimilation. In particular, we show that by considering the information assimilation algorithm, here a Numerical Gaussian Process Kalman Filter, the influence of measurements taken at one position on the uncertainty of the estimate at another location can be computed. We use this relationship to propose new coverage algorithms. Furthermore, we show that the controllers naturally extend to the multi-agent context, allowing for a distributed-control central-information paradigm for multi-agent coverage. Finally, we demonstrate the algorithms through a realistic simulation of a team of UAVs collecting wind data over a region in Austria.
comment: 8 pages, 2 figures, submitted to CDC 2024
☆ CMP: Cooperative Motion Prediction with Multi-Agent Communication
The confluence of the advancement of Autonomous Vehicles (AVs) and the maturity of Vehicle-to-Everything (V2X) communication has enabled the capability of cooperative connected and automated vehicles (CAVs). Building on top of cooperative perception, this paper explores the feasibility and effectiveness of cooperative motion prediction. Our method, CMP, takes LiDAR signals as input to enhance tracking and prediction capabilities. Unlike previous work that focuses separately on either cooperative perception or motion prediction, our framework, to the best of our knowledge, is the first to address the unified problem where CAVs share information in both perception and prediction modules. Incorporated into our design is the unique capability to tolerate realistic V2X bandwidth limitations and transmission delays, while dealing with bulky perception representations. We also propose a prediction aggregation module, which unifies the predictions obtained by different CAVs and generates the final prediction. Through extensive experiments and ablation studies, we demonstrate the effectiveness of our method in cooperative perception, tracking, and motion prediction tasks. In particular, CMP reduces the average prediction error by 17.2\% with fewer missing detections compared with the no cooperation setting. Our work marks a significant step forward in the cooperative capabilities of CAVs, showcasing enhanced performance in complex scenarios.
☆ Multi Agent Pathfinding for Noise Restricted Hybrid Fuel Unmanned Aerial Vehicles
Multi Agent Path Finding (MAPF) seeks the optimal set of paths for multiple agents from respective start to goal locations such that no paths conflict. We address the MAPF problem for a fleet of hybrid-fuel unmanned aerial vehicles which are subject to location-dependent noise restrictions. We solve this problem by searching a constraint tree for which the subproblem at each node is a set of shortest path problems subject to the noise and fuel constraints and conflict zone avoidance. A labeling algorithm is presented to solve this subproblem, including the conflict zones which are treated as dynamic obstacles. We present the experimental results of the algorithms for various graph sizes and number of agents.
comment: 6 pages, 7 figures
☆ Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/.
comment: Code and video are available at http://hovsg.github.io/
☆ Scenario-Based Curriculum Generation for Multi-Agent Autonomous Driving
The automated generation of diverse and complex training scenarios has been an important ingredient in many complex learning tasks. Especially in real-world application domains, such as autonomous driving, auto-curriculum generation is considered vital for obtaining robust and general policies. However, crafting traffic scenarios with multiple, heterogeneous agents is typically considered as a tedious and time-consuming task, especially in more complex simulation environments. In our work, we introduce MATS-Gym, a Multi-Agent Traffic Scenario framework to train agents in CARLA, a high-fidelity driving simulator. MATS-Gym is a multi-agent training framework for autonomous driving that uses partial scenario specifications to generate traffic scenarios with variable numbers of agents. This paper unifies various existing approaches to traffic scenario description into a single training framework and demonstrates how it can be integrated with techniques from unsupervised environment design to automate the generation of adaptive auto-curricula. The code is available at https://github.com/AutonomousDrivingExaminer/mats-gym.
comment: 7 Pages, Under Review
☆ System Calibration of a Field Phenotyping Robot with Multiple High-Precision Profile Laser Scanners
The creation of precise and high-resolution crop point clouds in agricultural fields has become a key challenge for high-throughput phenotyping applications. This work implements a novel calibration method to calibrate the laser scanning system of an agricultural field robot consisting of two industrial-grade laser scanners used for high-precise 3D crop point cloud creation. The calibration method optimizes the transformation between the scanner origins and the robot pose by minimizing 3D point omnivariances within the point cloud. Moreover, we present a novel factor graph-based pose estimation method that fuses total station prism measurements with IMU and GNSS heading information for high-precise pose determination during calibration. The root-mean-square error of the distances to a georeferenced ground truth point cloud results in 0.8 cm after parameter optimization. Furthermore, our results show the importance of a reference point cloud in the calibration method needed to estimate the vertical translation of the calibration. Challenges arise due to non-static parameters while the robot moves, indicated by systematic deviations to a ground truth terrestrial laser scan.
☆ Optical Flow Based Detection and Tracking of Moving Objects for Autonomous Vehicles
Accurate velocity estimation of surrounding moving objects and their trajectories are critical elements of perception systems in Automated/Autonomous Vehicles (AVs) with a direct impact on their safety. These are non-trivial problems due to the diverse types and sizes of such objects and their dynamic and random behaviour. Recent point cloud based solutions often use Iterative Closest Point (ICP) techniques, which are known to have certain limitations. For example, their computational costs are high due to their iterative nature, and their estimation error often deteriorates as the relative velocities of the target objects increase (>2 m/sec). Motivated by such shortcomings, this paper first proposes a novel Detection and Tracking of Moving Objects (DATMO) for AVs based on an optical flow technique, which is proven to be computationally efficient and highly accurate for such problems. \textcolor{black}{This is achieved by representing the driving scenario as a vector field and applying vector calculus theories to ensure spatiotemporal continuity.} We also report the results of a comprehensive performance evaluation of the proposed DATMO technique, carried out in this study using synthetic and real-world data. The results of this study demonstrate the superiority of the proposed technique, compared to the DATMO techniques in the literature, in terms of estimation accuracy and processing time in a wide range of relative velocities of moving objects. Finally, we evaluate and discuss the sensitivity of the estimation error of the proposed DATMO technique to various system and environmental parameters, as well as the relative velocities of the moving objects.
comment: This manuscript has been accepted as a regular paper in Transactions on Intelligent Transportation Systems (DOI: 10.1109/TITS.2024.3382495)
☆ LiDAR-Based Crop Row Detection Algorithm for Over-Canopy Autonomous Navigation in Agriculture Fields IROS 2024
Autonomous navigation is crucial for various robotics applications in agriculture. However, many existing methods depend on RTK-GPS systems, which are expensive and susceptible to poor signal coverage. This paper introduces a state-of-the-art LiDAR-based navigation system that can achieve over-canopy autonomous navigation in row-crop fields, even when the canopy fully blocks the interrow spacing. Our crop row detection algorithm can detect crop rows across diverse scenarios, encompassing various crop types, growth stages, weed presence, and discontinuities within the crop rows. Without utilizing the global localization of the robot, our navigation system can perform autonomous navigation in these challenging scenarios, detect the end of the crop rows, and navigate to the next crop row autonomously, providing a crop-agnostic approach to navigate the whole row-crop field. This navigation system has undergone tests in various simulated agricultural fields, achieving an average of $2.98cm$ autonomous driving accuracy without human intervention on the custom Amiga robot. In addition, the qualitative results of our crop row detection algorithm from the actual soybean fields validate our LiDAR-based crop row detection algorithm's potential for practical agricultural applications.
comment: 7 pages, 9 figures, submitted to IROS 2024
☆ Learning Goal-Directed Object Pushing in Cluttered Scenes with Location-Based Attention IROS
Non-prehensile planar pushing is a challenging task due to its underactuated nature with hybrid-dynamics, where a robot needs to reason about an object's long-term behaviour and contact-switching, while being robust to contact uncertainty. The presence of clutter in the environment further complicates this task, introducing the need to include more sophisticated spatial analysis to avoid collisions. Building upon prior work on reinforcement learning (RL) with multimodal categorical exploration for planar pushing, in this paper we incorporate location-based attention to enable robust navigation through clutter. Unlike previous RL literature addressing this obstacle avoidance pushing task, our framework requires no predefined global paths and considers the target orientation of the manipulated object. Our results demonstrate that the learned policies successfully navigate through a wide range of complex obstacle configurations, including dynamic obstacles, with smooth motions, achieving the desired target object pose. We also validate the transferability of the learned policies to robotic hardware using the KUKA iiwa robot arm.
comment: Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
☆ UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps
In this study, we address a gap in existing unsupervised domain adaptation approaches on LiDAR-based 3D object detection, which have predominantly concentrated on adapting between established, high-density autonomous driving datasets. We focus on sparser point clouds, capturing scenarios from different perspectives: not just from vehicles on the road but also from mobile robots on sidewalks, which encounter significantly different environmental conditions and sensor configurations. We introduce Unsupervised Adversarial Domain Adaptation for 3D Object Detection (UADA3D). UADA3D does not depend on pre-trained source models or teacher-student architectures. Instead, it uses an adversarial approach to directly learn domain-invariant features. We demonstrate its efficacy in various adaptation scenarios, showing significant improvements in both self-driving car and mobile robot domains. Our code is open-source and will be available soon.
☆ Online Tree Reconstruction and Forest Inventory on a Mobile Robotic System
Terrestrial laser scanning (TLS) is the standard technique used to create accurate point clouds for digital forest inventories. However, the measurement process is demanding, requiring up to two days per hectare for data collection, significant data storage, as well as resource-heavy post-processing of 3D data. In this work, we present a real-time mapping and analysis system that enables online generation of forest inventories using mobile laser scanners that can be mounted e.g. on mobile robots. Given incrementally created and locally accurate submaps-data payloads-our approach extracts tree candidates using a custom, Voronoi-inspired clustering algorithm. Tree candidates are reconstructed using an adapted Hough algorithm, which enables robust modeling of the tree stem. Further, we explicitly incorporate the incremental nature of the data collection by consistently updating the database using a pose graph LiDAR SLAM system. This enables us to refine our estimates of the tree traits if an area is revisited later during a mission. We demonstrate competitive accuracy to TLS or manual measurements using laser scanners that we mounted on backpacks or mobile robots operating in conifer, broad-leaf and mixed forests. Our results achieve RMSE of 1.93 cm, a bias of 0.65 cm and a standard deviation of 1.81 cm (averaged across these sequences)-with no post-processing required after the mission is complete.
☆ Interactive Identification of Granular Materials using Force Measurements IROS 2024
The ability to identify granular materials facilitates the emergence of various new applications in robotics, ranging from cooking at home to truck loading at mining sites. However, granular material identification remains a challenging and underexplored area. In this work, we present a novel interactive material identification framework that enables robots to identify a wide range of granular materials using only a force-torque sensor for perception. Our framework, comprising interactive exploration, feature extraction, and classification stages, prioritizes simplicity and transparency for seamless integration into various manipulation pipelines. We evaluate the proposed approach through extensive experiments with a real-world dataset comprising 11 granular materials, which we also make publicly available. Additionally, we conducted a comprehensive qualitative analysis of the dataset to offer deeper insights into its nature, aiding future development. Our results show that the proposed method is capable of accurately identifying a wide range of granular materials solely relying on force measurements obtained from direct interaction with the materials. Code and dataset are available at: https://irobotics.aalto.fi/indentify_granular/.
comment: Submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
☆ Aerial Robots Carrying Flexible Cables: Dynamic Shape Optimal Control via Spectral Method Model
In this work, we present a model-based optimal boundary control design for an aerial robotic system composed of a quadrotor carrying a flexible cable. The whole system is modeled by partial differential equations (PDEs) combined with boundary conditions described by ordinary differential equations (ODEs). The proper orthogonal decomposition (POD) method is adopted to project the original infinite-dimensional system on a subspace spanned by orthogonal basis functions. Based on the reduced order model, nonlinear model predictive control (NMPC) is implemented online to realize shape trajectory tracking of the flexible cable in an optimal predictive fashion. The proposed reduced modeling and optimal control paradigms are numerically verified against an accurate high-dimensional FDM-based model in different scenarios and the controller's superior performance is shown compared to an optimally tuned PID controller.
☆ Time-Optimal Flight with Safety Constraints and Data-driven Dynamics
Time-optimal quadrotor flight is an extremely challenging problem due to the limited control authority encountered at the limit of handling. Model Predictive Contouring Control (MPCC) has emerged as a leading model-based approach for time optimization problems such as drone racing. However, the standard MPCC formulation used in quadrotor racing introduces the notion of the gates directly in the cost function, creating a multi-objective optimization that continuously trades off between maximizing progress and tracking the path accurately. This paper introduces three key components that enhance the MPCC approach for drone racing. First and foremost, we provide safety guarantees in the form of a constraint and terminal set. The safety set is designed as a spatial constraint which prevents gate collisions while allowing for time-optimization only in the cost function. Second, we augment the existing first principles dynamics with a residual term that captures complex aerodynamic effects and thrust forces learned directly from real world data. Third, we use Trust Region Bayesian Optimization (TuRBO), a state of the art global Bayesian Optimization algorithm, to tune the hyperparameters of the MPC controller given a sparse reward based on lap time minimization. The proposed approach achieves similar lap times to the best state-of-the-art RL and outperforms the best time-optimal controller while satisfying constraints. In both simulation and real-world, our approach consistently prevents gate crashes with 100\% success rate, while pushing the quadrotor to its physical limit reaching speeds of more than 80km/h.
comment: 12 pages, 7 figures
☆ DeepMIF: Deep Monotonic Implicit Fields for Large-Scale LiDAR 3D Mapping
Recently, significant progress has been achieved in sensing real large-scale outdoor 3D environments, particularly by using modern acquisition equipment such as LiDAR sensors. Unfortunately, they are fundamentally limited in their ability to produce dense, complete 3D scenes. To address this issue, recent learning-based methods integrate neural implicit representations and optimizable feature grids to approximate surfaces of 3D scenes. However, naively fitting samples along raw LiDAR rays leads to noisy 3D mapping results due to the nature of sparse, conflicting LiDAR measurements. Instead, in this work we depart from fitting LiDAR data exactly, instead letting the network optimize a non-metric monotonic implicit field defined in 3D space. To fit our field, we design a learning system integrating a monotonicity loss that enables optimizing neural monotonic fields and leverages recent progress in large-scale 3D mapping. Our algorithm achieves high-quality dense 3D mapping performance as captured by multiple quantitative and perceptual measures and visual results obtained for Mai City, Newer College, and KITTI benchmarks. The code of our approach will be made publicly available.
comment: 8 pages, 6 figures
☆ Design and Preliminary Evaluation of a Torso Stabiliser for Individuals with Spinal Cord Injury
Spinal cord injuries (SCIs) generally result in sensory and mobility impairments, with torso instability being particularly debilitating. Existing torso stabilisers are often rigid and restrictive. This paper presents an early investigation into a non-restrictive 1 degree-of-freedom (DoF) mechanical torso stabiliser inspired by devices such as centrifugal clutches and seat-belt mechanisms. Firstly, the paper presents a motion-capture (MoCap) and OpenSim-based kinematic analysis of the cable-based system to understand requisite device characteristics. The simulated evaluation resulted in the cable-based device to require 55-60cm of unrestricted travel, and to lock at a threshold cable velocity of 80-100cm/sec. Next, the developed 1-DoF device is introduced. The proposed mechanical device is transparent during activities of daily living, and transitions to compliant blocking when incipient fall is detected. Prototype behaviour was then validated using a MoCap-based kinematic analysis to verify non-restrictive movement, reliable transition to blocking, and compliance of the blocking.
comment: 4 pages, 4 figures, 10 references. Submitted to IEEE EMBC 2024 conference
☆ High-Power, Flexible, Robust Hand: Development of Musculoskeletal Hand Using Machined Springs and Realization of Self-Weight Supporting Motion with Humanoid IROS2017
Human can not only support their body during standing or walking, but also support them by hand, so that they can dangle a bar and others. But most humanoid robots support their body only in the foot and they use their hand just to manipulate objects because their hands are too weak to support their body. Strong hands are supposed to enable humanoid robots to act in much broader scene. Therefore, we developed new life-size five-fingered hand that can support the body of life-size humanoid robot. It is tendon-driven and underactuated hand and actuators in forearms produce large gripping force. This hand has flexible joints using machined springs, which can be designed integrally with the attachment. Thus, it has both structural strength and impact resistance in spite of small size. As other characteristics, this hand has force sensors to measure external force and the fingers can be flexed along objects though the number of actuators to flex fingers is less than that of fingers. We installed the developed hand on musculoskeletal humanoid "Kengoro" and achieved two self-weight supporting motions: push-up motion and dangling motion.
comment: accepted at IROS2017
☆ Five-fingered Hand with Wide Range of Thumb Using Combination of Machined Springs and Variable Stiffness Joints IROS2018
Human hands can not only grasp objects of various shape and size and manipulate them in hands but also exert such a large gripping force that they can support the body in the situations such as dangling a bar and climbing a ladder. On the other hand, it is difficult for most robot hands to manage both. Therefore in this paper we developed the hand which can grasp various objects and exert large gripping force. To develop such hand, we focused on the thumb CM joint with wide range of motion and the MP joints of four fingers with the DOF of abduction and adduction. Based on the hand with large gripping force and flexibility using machined spring, we applied above mentioned joint mechanism to the hand. The thumb CM joint has wide range of motion because of the combination of three machined springs and MP joints of four fingers have variable rigidity mechanism instead of driving each joint independently in order to move joint in limited space and by limited actuators. Using the developed hand, we achieved the grasping of various objects, supporting a large load and several motions with an arm.
comment: accepted at IROS2018
☆ Adaptive Line-Of-Sight guidance law based on vector fields path following for underactuated unmanned surface vehicle
The focus of this paper is to develop a methodology that enables an unmanned surface vehicle (USV) to efficiently track a planned path. The introduction of a vector field-based adaptive line-of-sight guidance law (VFALOS) for accurate trajectory tracking and minimizing the overshoot response time during USV tracking of curved paths improves the overall line-of-sight (LOS) guidance method. These improvements contribute to faster convergence to the desired path, reduce oscillations, and can mitigate the effects of persistent external disturbances. It is shown that the proposed guidance law exhibits k-exponential stability when converging to the desired path consisting of straight and curved lines. The results in the paper show that the proposed method effectively improves the accuracy of the USV tracking the desired path while ensuring the safety of the USV work.
☆ Adaptive LiDAR-Radar Fusion for Outdoor Odometry Across Dense Smoke Conditions
Robust odometry estimation in perceptually degraded environments represents a key challenge in the field of robotics. In this paper, we propose a LiDAR-radar fusion method for robust odometry for adverse environment with LiDAR degeneracy. By comparing the LiDAR point cloud with the radar static point cloud obtained through preprocessing module, it is possible to identify instances of LiDAR degeneracy to overcome perceptual limits. We demonstrate the effectiveness of our method in challenging conditions such as dense smoke, showcasing its ability to reliably estimate odometry and identify/remove dynamic points prone to LiDAR degeneracy.
☆ Cyclic pursuit formation control for arbitrary desired shapes
A multi-agent system comprises numerous agents that autonomously make decisions to collectively accomplish tasks, drawing significant attention for their wide-ranging applications. Within this context, formation control emerges as a prominent task, wherein agents collaboratively shape and maneuver while preserving formation integrity. Our focus centers on cyclic pursuit, a method facilitating the formation of circles, ellipses, and figure-eights under the assumption that agents can only perceive the relative positions of those preceding them. However, this method's scope has been restricted to these specific shapes, leaving the feasibility of forming other shapes uncertain. In response, our study proposes a novel method based on cyclic pursuit capable of forming a broader array of shapes, enabling agents to individually shape while pursuing preceding agents, thereby extending the repertoire of achievable formations. We present two scenarios concerning the information available to agents and devise formation control methods tailored to each scenario. Through extensive simulations, we demonstrate the efficacy of our proposed method in forming multiple shapes, including those represented as Fourier series, thereby underscoring the versatility and effectiveness of our approach.
☆ Natural-artificial hybrid swarm: Cyborg-insect group navigation in unknown obstructed soft terrain
Navigating multi-robot systems in complex terrains has always been a challenging task. This is due to the inherent limitations of traditional robots in collision avoidance, adaptation to unknown environments, and sustained energy efficiency. In order to overcome these limitations, this research proposes a solution by integrating living insects with miniature electronic controllers to enable robotic-like programmable control, and proposing a novel control algorithm for swarming. Although these creatures, called cyborg insects, have the ability to instinctively avoid collisions with neighbors and obstacles while adapting to complex terrains, there is a lack of literature on the control of multi-cyborg systems. This research gap is due to the difficulty in coordinating the movements of a cyborg system under the presence of insects' inherent individual variability in their reactions to control input. In response to this issue, we propose a novel swarm navigation algorithm addressing these challenges. The effectiveness of the algorithm is demonstrated through an experimental validation in which a cyborg swarm was successfully navigated through an unknown sandy field with obstacles and hills. This research contributes to the domain of swarm robotics and showcases the potential of integrating biological organisms with robotics and control theory to create more intelligent autonomous systems with real-world applications.
☆ RoboDuet: A Framework Affording Mobile-Manipulation and Cross-Embodiment
Combining the mobility of legged robots with the manipulation skills of arms has the potential to significantly expand the operational range and enhance the capabilities of robotic systems in performing various mobile manipulation tasks. Existing approaches are confined to imprecise six degrees of freedom (DoF) manipulation and possess a limited arm workspace. In this paper, we propose a novel framework, RoboDuet, which employs two collaborative policies to realize locomotion and manipulation simultaneously, achieving whole-body control through interactions between each other. Surprisingly, going beyond the large-range pose tracking, we find that the two-policy framework may enable cross-embodiment deployment such as using different quadrupedal robots or other arms. Our experiments demonstrate that the policies trained through RoboDuet can accomplish stable gaits, agile 6D end-effector pose tracking, and zero-shot exchange of legged robots, and can be deployed in the real world to perform various mobile manipulation tasks. Our project page with demo videos is at https://locomanip-duet.github.io .
☆ Multi-Objective Trajectory Planning with Dual-Encoder
Time-jerk optimal trajectory planning is crucial in advancing robotic arms' performance in dynamic tasks. Traditional methods rely on solving complex nonlinear programming problems, bringing significant delays in generating optimized trajectories. In this paper, we propose a two-stage approach to accelerate time-jerk optimal trajectory planning. Firstly, we introduce a dual-encoder based transformer model to establish a good preliminary trajectory. This trajectory is subsequently refined through sequential quadratic programming to improve its optimality and robustness. Our approach outperforms the state-of-the-art by up to 79.72\% in reducing trajectory planning time. Compared with existing methods, our method shrinks the optimality gap with the objective function value decreasing by up to 29.9\%.
comment: 6 pages, 7 figures, conference
☆ Unified Path and Gait Planning for Safe Bipedal Robot Navigation
Safe path and gait planning are essential for bipedal robots to navigate complex real-world environments. The prevailing approaches often plan the path and gait separately in a hierarchical fashion, potentially resulting in unsafe movements due to neglecting the physical constraints of walking robots. A safety-critical path must not only avoid obstacles but also ensure that the robot's gaits are subject to its dynamic and kinematic constraints. This work presents a novel approach that unifies path planning and gait planning via a Model Predictive Control (MPC) using the Linear Inverted Pendulum (LIP) model representing bipedal locomotion. This approach considers environmental constraints, such as obstacles, and the robot's kinematics and dynamics constraints. By using discrete-time Control Barrier Functions for obstacle avoidance, our approach generates the next foot landing position, ensuring robust walking gaits and a safe navigation path within clustered environments. We validated our proposed approach in simulation using a Digit robot in 20 randomly created environments. The results demonstrate improved performance in terms of safety and robustness when compared to hierarchical path and gait planning frameworks.
comment: 8 pages
☆ Leveraging Symmetry in RL-based Legged Locomotion Control
Model-free reinforcement learning is a promising approach for autonomously solving challenging robotics control problems, but faces exploration difficulty without information of the robot's kinematics and dynamics morphology. The under-exploration of multiple modalities with symmetric states leads to behaviors that are often unnatural and sub-optimal. This issue becomes particularly pronounced in the context of robotic systems with morphological symmetries, such as legged robots for which the resulting asymmetric and aperiodic behaviors compromise performance, robustness, and transferability to real hardware. To mitigate this challenge, we can leverage symmetry to guide and improve the exploration in policy learning via equivariance/invariance constraints. In this paper, we investigate the efficacy of two approaches to incorporate symmetry: modifying the network architectures to be strictly equivariant/invariant, and leveraging data augmentation to approximate equivariant/invariant actor-critics. We implement the methods on challenging loco-manipulation and bipedal locomotion tasks and compare with an unconstrained baseline. We find that the strictly equivariant policy consistently outperforms other methods in sample efficiency and task performance in simulation. In addition, symmetry-incorporated approaches exhibit better gait quality, higher robustness and can be deployed zero-shot in real-world experiments.
☆ Sparse-Graph-Enabled Formation Planning for Large-Scale Aerial Swarms
The formation trajectory planning using complete graphs to model collaborative constraints becomes computationally intractable as the number of drones increases due to the curse of dimensionality. To tackle this issue, this paper presents a sparse graph construction method for formation planning to realize better efficiency-performance trade-off. Firstly, a sparsification mechanism for complete graphs is designed to ensure the global rigidity of sparsified graphs, which is a necessary condition for uniquely corresponding to a geometric shape. Secondly, a good sparse graph is constructed to preserve the main structural feature of complete graphs sufficiently. Since the graph-based formation constraint is described by Laplacian matrix, the sparse graph construction problem is equivalent to submatrix selection, which has combinatorial time complexity and needs a scoring metric. Via comparative simulations, the Max-Trace matrix-revealing metric shows the promising performance. The sparse graph is integrated into the formation planning. Simulation results with 72 drones in complex environments demonstrate that when preserving 30\% connection edges, our method has comparative formation error and recovery performance w.r.t. complete graphs. Meanwhile, the planning efficiency is improved by approximate an order of magnitude. Benchmark comparisons and ablation studies are conducted to fully validate the merits of our method.
♻ ☆ Safe Explicable Planning
Human expectations arise from their understanding of others and the world. In the context of human-AI interaction, this understanding may not align with reality, leading to the AI agent failing to meet expectations and compromising team performance. Explicable planning, introduced as a method to bridge this gap, aims to reconcile human expectations with the agent's optimal behavior, facilitating interpretable decision-making. However, an unresolved critical issue is ensuring safety in explicable planning, as it could result in explicable behaviors that are unsafe. To address this, we propose Safe Explicable Planning (SEP), which extends the prior work to support the specification of a safety bound. The goal of SEP is to find behaviors that align with human expectations while adhering to the specified safety criterion. Our approach generalizes the consideration of multiple objectives stemming from multiple models rather than a single model, yielding a Pareto set of safe explicable policies. We present both an exact method, guaranteeing finding the Pareto set, and a more efficient greedy method that finds one of the policies in the Pareto set. Additionally, we offer approximate solutions based on state aggregation to improve scalability. We provide formal proofs that validate the desired theoretical properties of these methods. Evaluation through simulations and physical robot experiments confirms the effectiveness of our approach for safe explicable planning.
♻ ☆ Resilient source seeking with robot swarms
We present a solution for locating the source, or maximum, of an unknown scalar field using a swarm of mobile robots. Unlike relying on the traditional gradient information, the swarm determines an ascending direction to approach the source with arbitrary precision. The ascending direction is calculated from measurements of the field strength at the robot locations and their relative positions concerning the centroid. Rather than focusing on individual robots, we focus the analysis on the density of robots per unit area to guarantee a more resilient swarm, i.e., the functionality remains even if individuals go missing or are misplaced during the mission. We reinforce the robustness of the algorithm by providing sufficient conditions for the swarm shape so that the ascending direction is almost parallel to the gradient. The swarm can respond to an unexpected environment by morphing its shape and exploiting the existence of multiple ascending directions. Finally, we validate our approach numerically with hundreds of robots. The fact that a large number of robots always calculate an ascending direction compensates for the loss of individuals and mitigates issues arising from the actuator and sensor noises.
comment: 7 pages, submitted to CDC 2024
♻ ☆ Tuning-free Quasi-stiffness Control Framework of a Powered Transfemoral Prosthesis for Task-adaptive Walking
Impedance-based control represents a prevalent strategy in the development of powered transfemoral prostheses. However, creating a task-adaptive, tuning-free controller that effectively generalizes across diverse locomotion modes and terrain conditions continues to be a significant challenge. This letter proposes a tuning-free and task-adaptive quasi-stiffness control framework for powered prostheses that generalizes across various walking tasks, including the torque-angle relationship reconstruction part and the quasi-stiffness controller design part. A Gaussian Process Regression (GPR) model is introduced to predict the target features of the human joint angle and torque in a new task. Subsequently, a Kernelized Movement Primitives (KMP) is employed to reconstruct the torque-angle relationship of the new task from multiple human reference trajectories and estimated target features. Based on the torque-angle relationship of the new task, a quasi-stiffness control approach is designed for a powered prosthesis. Finally, the proposed framework is validated through practical examples, including varying speeds and inclines walking tasks. Notably, the proposed framework not only aligns with but frequently surpasses the performance of a benchmark finite state machine impedance controller (FSMIC) without necessitating manual impedance tuning and has the potential to expand to variable walking tasks in daily life for the transfemoral amputees.
comment: 8 pages, 10 figures. This work has been submitted to the IEEE-RAL for possible publication
♻ ☆ Domain Randomization via Entropy Maximization ICLR 2024
Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.
comment: Published as a conference paper at ICLR 2024. Project website at https://gabrieletiboni.github.io/doraemon/
♻ ☆ SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.
♻ ☆ Towards Source-free Domain Adaptive Semantic Segmentation via Importance-aware and Prototype-contrast Learning
Domain adaptive semantic segmentation enables robust pixel-wise understanding in real-world driving scenes. Source-free domain adaptation, as a more practical technique, addresses the concerns of data privacy and storage limitations in typical unsupervised domain adaptation methods, making it especially relevant in the context of intelligent vehicles. It utilizes a well-trained source model and unlabeled target data to achieve adaptation in the target domain. However, in the absence of source data and target labels, current solutions cannot sufficiently reduce the impact of domain shift and fully leverage the information from the target data. In this paper, we propose an end-to-end source-free domain adaptation semantic segmentation method via Importance-Aware and Prototype-Contrast (IAPC) learning. The proposed IAPC framework effectively extracts domain-invariant knowledge from the well-trained source model and learns domain-specific knowledge from the unlabeled target domain. Specifically, considering the problem of domain shift in the prediction of the target domain by the source model, we put forward an importance-aware mechanism for the biased target prediction probability distribution to extract domain-invariant knowledge from the source model. We further introduce a prototype-contrast strategy, which includes a prototype-symmetric cross-entropy loss and a prototype-enhanced cross-entropy loss, to learn target intra-domain knowledge without relying on labels. A comprehensive variety of experiments on two domain adaptive semantic segmentation benchmarks demonstrates that the proposed end-to-end IAPC solution outperforms existing state-of-the-art methods. The source code is publicly available at https://github.com/yihong-97/Source-free-IAPC.
comment: Accepted to IEEE Transactions on Intelligent Vehicles (T-IV). The source code is publicly available at https://github.com/yihong-97/Source-free-IAPC
♻ ☆ When Robotics Meets Wireless Communications: An Introductory Tutorial
The importance of ground Mobile Robots (MRs) and Unmanned Aerial Vehicles (UAVs) within the research community, industry, and society is growing fast. Many of these agents are nowadays equipped with communication systems that are, in some cases, essential to successfully achieve certain tasks. In this context, we have begun to witness the development of a new interdisciplinary research field at the intersection of robotics and communications. This research field has been boosted by the intention of integrating UAVs within the 5G and 6G communication networks. This research will undoubtedly lead to many important applications in the near future. Nevertheless, one of the main obstacles to the development of this research area is that most researchers address these problems by oversimplifying either the robotics or the communications aspect. This impedes the ability of reaching the full potential of this new interdisciplinary research area. In this tutorial, we present some of the modelling tools necessary to address problems involving both robotics and communication from an interdisciplinary perspective. As an illustrative example of such problems, we focus in this tutorial on the issue of communication-aware trajectory planning.
comment: 35 pages, 192 references
♻ ☆ Motion Generation from Fine-grained Textual Descriptions
The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.
♻ ☆ Attention-based Estimation and Prediction of Human Intent to augment Haptic Glove aided Control of Robotic Hand
The letter focuses on Haptic Glove (HG) based control of a Robotic Hand (RH) executing in-hand manipulation of certain objects of interest. The high dimensional motion signals in HG and RH possess intrinsic variability of kinematics resulting in difficulty to establish a direct mapping of the motion signals from HG onto the RH. An estimation mechanism is proposed to quantify the motion signal acquired from the human controller in relation to the intended goal pose of the object being held by the robotic hand. A control algorithm is presented to transform the synthesized intent at the RH and allow relocation of the object to the expected goal pose. The lag in synthesis of the intent in the presence of communication delay leads to a requirement of predicting the estimated intent. We leverage an attention-based convolutional neural network encoder to predict the trajectory of intent for a certain lookahead to compensate for the delays. The proposed methodology is evaluated across objects of different shapes, mass, and materials. We present a comparative performance of the estimation and prediction mechanisms on 5G-driven real-world robotic setup against benchmark methodologies. The test-MSE in prediction of human intent is reported to yield ~ 97.3 -98.7% improvement of accuracy in comparison to LSTM-based benchmark
♻ ☆ Guessing human intentions to avoid dangerous situations in caregiving robots IROS2024
For robots to interact socially, they must interpret human intentions and anticipate their potential outcomes accurately. This is particularly important for social robots designed for human care, which may face potentially dangerous situations for people, such as unseen obstacles in their way, that should be avoided. This paper explores the Artificial Theory of Mind (ATM) approach to inferring and interpreting human intentions. We propose an algorithm that detects risky situations for humans, selecting a robot action that removes the danger in real time. We use the simulation-based approach to ATM and adopt the 'like-me' policy to assign intentions and actions to people. Using this strategy, the robot can detect and act with a high rate of success under time-constrained situations. The algorithm has been implemented as part of an existing robotics cognitive architecture and tested in simulation scenarios. Three experiments have been conducted to test the implementation's robustness, precision and real-time response, including a simulated scenario, a human-in-the-loop hybrid configuration and a real-world scenario.
comment: 8 pages, 6 figures. Submitted to IROS2024. For associated mpeg file see https://youtu.be/87UEB8P97KY
♻ ☆ Full Attitude Intelligent Controller Design of a Heliquad under Complete Failure of an Actuator
In this paper, we design a reliable Heliquad and develop an intelligent controller to handle one actuators complete failure. Heliquad is a multi-copter similar to Quadcopter, with four actuators diagonally symmetric from the center. Each actuator has two control inputs; the first input changes the propeller blades collective pitch (also called variable pitch), and the other input changes the rotation speed. For reliable operation and high torque characteristic requirement for yaw control, a cambered airfoil is used to design propeller blades. A neural network-based control allocation is designed to provide complete control authority even under a complete loss of one actuator. Nonlinear quaternion based outer loop position control, with proportional-derivative inner loop for attitude control and neural network-based control allocation is used in controller design. The proposed controller and Heliquad designs performance is evaluated using a software-in-loop simulation to track the position reference command under failure. The results clearly indicate that the Heliquad with an intelligent controller provides necessary tracking performance even under a complete loss of one actuator.
comment: 7 pages, For video go to https://indianinstituteofscience-my.sharepoint.com/:v:/g/personal/eeshank_iisc_ac_in/EcMg2uTtE91AsHDejNkb6YMBNckaXGjeh_YMzDV6sAHZAQ?e=DrRqmN
♻ ☆ Towards Massive Interaction with Generalist Robotics: A Systematic Review of XR-enabled Remote Human-Robot Interaction Systems
The rising interest of generalist robots seek to create robots with versatility to handle multiple tasks in a variety of environments, and human will interact with such robots through immersive interfaces. In the context of human-robot interaction (HRI), this survey provides an exhaustive review of the applications of extended reality (XR) technologies in the field of remote HRI. We developed a systematic search strategy based on the PRISMA methodology. From the initial 2,561 articles selected, 100 research papers that met our inclusion criteria were included. We categorized and summarized the domain in detail, delving into XR technologies, including augmented reality (AR), virtual reality (VR), and mixed reality (MR), and their applications in facilitating intuitive and effective remote control and interaction with robotic systems. The survey highlights existing articles on the application of XR technologies, user experience enhancement, and various interaction designs for XR in remote HRI, providing insights into current trends and future directions. We also identified potential gaps and opportunities for future research to improve remote HRI systems through XR technology to guide and inform future XR and robotics research.
♻ ☆ Autonomous Hook-Based Grasping and Transportation with Quadcopters
Payload grasping and transportation with quadcopters is an active research area that has rapidly developed over the last decade. To grasp a payload without human interaction, most state-of-the-art approaches apply robotic arms that are attached to the quadcopter body. However, due to the large weight and power consumption of these aerial manipulators, their agility and flight time are limited. This paper proposes a motion control and planning method for transportation with a lightweight, passive manipulator structure that consists of a hook attached to a quadrotor using a 1 DoF revolute joint. To perform payload grasping, transportation, and release, first, time-optimal reference trajectories are designed through specific waypoints to ensure the fast and reliable execution of the tasks. Then, a two-stage motion control approach is developed based on a robust geometric controller for precise and reliable reference tracking and a linear--quadratic payload regulator for rapid setpoint stabilization of the payload swing. Furthermore, stability of the closed-loop system is mathematically proven to give safety guarantee for its operation. The proposed control architecture and design are evaluated in a high-fidelity physical simulator, and also in real flight experiments, using a custom-made quadrotor--hook manipulator platform.
♻ ☆ Robustness Evaluation of Localization Techniques for Autonomous Racing
This work introduces SynPF, an MCL-based algorithm tailored for high-speed racing environments. Benchmarked against Cartographer, a state-of-the-art pose-graph SLAM algorithm, SynPF leverages synergies from previous particle-filtering methods and synthesizes them for the high-performance racing domain. Our extensive in-field evaluations reveal that while Cartographer excels under nominal conditions, it struggles when subjected to wheel-slip, a common phenomenon in a racing scenario due to varying grip levels and aggressive driving behaviour. Conversely, SynPF demonstrates robustness in these challenging conditions and a low-latency computation time of 1.25 ms on on-board computers without a GPU. Using the F1TENTH platform, a 1:10 scaled autonomous racing vehicle, this work not only highlights the vulnerabilities of existing algorithms in high-speed scenarios, tested up until 7.6 m/s, but also emphasizes the potential of SynPF as a viable alternative, especially in deteriorating odometry conditions.
comment: Accepted at the Design, Automation and Test in Europe Conference 2024 as an extended abstract
♻ ☆ OCC-VO: Dense Mapping via 3D Occupancy-Based Visual Odometry for Autonomous Driving
Visual Odometry (VO) plays a pivotal role in autonomous systems, with a principal challenge being the lack of depth information in camera images. This paper introduces OCC-VO, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations. Within this framework, we utilize the TPV-Former to convert surround view cameras' images into 3D semantic occupancy. Addressing the challenges presented by this transformation, we have specifically tailored a pose estimation and mapping algorithm that incorporates Semantic Label Filter, Dynamic Object Filter, and finally, utilizes Voxel PFilter for maintaining a consistent global semantic map. Evaluations on the Occ3D-nuScenes not only showcase a 20.6% improvement in Success Ratio and a 29.6% enhancement in trajectory accuracy against ORB-SLAM3, but also emphasize our ability to construct a comprehensive map. Our implementation is open-sourced and available at: https://github.com/USTCLH/OCC-VO.
comment: 7pages, 3 figures
♻ ☆ Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models
Learning priors on trajectory distributions can help accelerate robot motion planning optimization. Given previously successful plans, learning trajectory generative models as priors for a new planning problem is highly desirable. Prior works propose several ways on utilizing this prior to bootstrapping the motion planning problem. Either sampling the prior for initializations or using the prior distribution in a maximum-a-posterior formulation for trajectory optimization. In this work, we propose learning diffusion models as priors. We then can sample directly from the posterior trajectory distribution conditioned on task goals, by leveraging the inverse denoising process of diffusion models. Furthermore, diffusion has been recently shown to effectively encode data multimodality in high-dimensional settings, which is particularly well-suited for large trajectory dataset. To demonstrate our method efficacy, we compare our proposed method - Motion Planning Diffusion - against several baselines in simulated planar robot and 7-dof robot arm manipulator environments. To assess the generalization capabilities of our method, we test it in environments with previously unseen obstacles. Our experiments show that diffusion models are strong priors to encode high-dimensional trajectory distributions of robot motions.
♻ ☆ Feeling Optimistic? Ambiguity Attitudes for Online Decision Making
Due to the complexity of many decision making problems, tree search algorithms often have inadequate information to produce accurate transition models. Robust methods, designed to make safe decisions when faced with these uncertainties, often overlook the impact expressions of uncertainty have on how the decision is made. This work introduces the Ambiguity Attitude Graph Search (AAGS), advocating for more precise representation of ambiguities (uncertainty from a set of plausible models) in decision making. Additionally, AAGS allows users to adjust their ambiguity attitude (or preference), promoting exploration and improving users' ability to control how an agent should respond when faced with a set of valid alternatives. Simulation in a dynamic sailing environment shows how highly stochastic environments can lead robust methods to fail. Results further demonstrate how adjusting ambiguity attitudes better fulfills objectives while mitigating this failure mode of robust approaches. Because this approach is a generalization of the robust framework, these results further demonstrate how algorithms focused on ambiguity have applicability beyond safety-critical systems.
comment: 6 pages, 5 figures, 2 algorithms. Submitted to the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems in Abu Dhabi, UAE (Oct 14-18, 2024)
♻ ☆ Visual Whole-Body Control for Legged Loco-Manipulation
We study the problem of mobile manipulation using legged robots equipped with an arm, namely legged loco-manipulation. The robot legs, while usually utilized for mobility, offer an opportunity to amplify the manipulation capabilities by conducting whole-body control. That is, the robot can control the legs and the arm at the same time to extend its workspace. We propose a framework that can conduct the whole-body control autonomously with visual observations. Our approach, namely Visual Whole-Body Control(VBC), is composed of a low-level policy using all degrees of freedom to track the end-effector manipulator position and a high-level policy proposing the end-effector position based on visual inputs. We train both levels of policies in simulation and perform Sim2Real transfer for real robot deployment. We perform extensive experiments and show significant improvements over baselines in picking up diverse objects in different configurations (heights, locations, orientations) and environments. Project page: https://wholebody-b1.github.io
comment: The first two authors contribute equally. Project page: https://wholebody-b1.github.io
Systems and Control
☆ Multi-Agent Clarity-Aware Dynamic Coverage with Gaussian Processes
This paper presents two algorithms for multi-agent dynamic coverage in spatiotemporal environments, where the coverage algorithms are informed by the method of data assimilation. In particular, we show that by considering the information assimilation algorithm, here a Numerical Gaussian Process Kalman Filter, the influence of measurements taken at one position on the uncertainty of the estimate at another location can be computed. We use this relationship to propose new coverage algorithms. Furthermore, we show that the controllers naturally extend to the multi-agent context, allowing for a distributed-control central-information paradigm for multi-agent coverage. Finally, we demonstrate the algorithms through a realistic simulation of a team of UAVs collecting wind data over a region in Austria.
comment: 8 pages, 2 figures, submitted to CDC 2024
☆ Multi-Agent Resilient Consensus under Intermittent Faulty and Malicious Transmissions (Extended Version)
In this work, we consider the consensus problem in which legitimate agents share their values over an undirected communication network in the presence of malicious or faulty agents. Different from the previous works, we characterize the conditions that generalize to several scenarios such as intermittent faulty or malicious transmissions, based on trust observations. As the standard trust aggregation approach based on a constant threshold fails to distinguish intermittent malicious/faulty activity, we propose a new detection algorithm utilizing time-varying thresholds and the random trust values available to legitimate agents. Under these conditions, legitimate agents almost surely determine their trusted neighborhood correctly with geometrically decaying misclassification probabilities. We further prove that the consensus process converges almost surely even in the presence of malicious agents. We also derive the probabilistic bounds on the deviation from the nominal consensus value that would have been achieved with no malicious agents in the system. Numerical results verify the convergence among agents and exemplify the deviation under different scenarios.
comment: Extended Version of CDC '24 submission
☆ Stealthy Deactivation of Safety Filters
Safety filters ensure that only safe control actions are executed. We propose a simple and stealthy false-data injection attack for deactivating such safety filters; in particular, we focus on deactivating safety filters that are based on control-barrier functions. The attack injects false sensor measurements to bias state estimates to the interior of a safety region, which makes the safety filter accept unsafe control actions. To detect such attacks, we also propose a detector that detects biases manufactured by the proposed attack policy, which complements conventional detectors when safety filters are used. The proposed attack policy and detector are illustrated on a double integrator example.
comment: ECC24
☆ Learning the Optimal Power Flow: Environment Design Matters
To solve the optimal power flow (OPF) problem, reinforcement learning (RL) emerges as a promising new approach. However, the RL-OPF literature is strongly divided regarding the exact formulation of the OPF problem as an RL environment. In this work, we collect and implement diverse environment design decisions from the literature regarding training data, observation space, episode definition, and reward function choice. In an experimental analysis, we show the significant impact of these environment design options on RL-OPF training performance. Further, we derive some first recommendations regarding the choice of these design decisions. The created environment framework is fully open-source and can serve as a benchmark for future research in the RL-OPF field.
☆ Steering Feedback in Dynamic Driving Simulators: The Influence of Steering Wheel Vibration and Vehicle Motion Frequency
The validity of the subjective evaluation of steering feedback in driving simulators is crucial for modern vehicle development. Although there are established objective steering characteristics for the assessment of both stationary and dynamic feedback behaviour, factors such as steering wheel vibrations and vehicle body motion, particularly in high-frequency ranges, present challenges in simulator fidelity. This work investigates the influence of steering wheel vibration and vehicle body motion frequency content on the subjective evaluation of steering feedback during closed-loop driving in a dynamic driving simulator. A controlled subject study with 30 participants consisting of a back-to-back comparison of a reference vehicle with an electrical power steering system and three variants of its virtual representation on a dynamic driving simulator was performed. Subjective evaluation focused on the representation of road feedback in comparison to the reference vehicle. The statistical analysis of subjective results show that there is a significant influence of the frequency content of both steering wheel torque and vehicle motion on the subjective evaluation of steering feedback in a dynamic driving simulator. The results suggest an influence of frequency content on the subjective evaluation quality of steering feedback characteristics that are not associated with the dynamic feedback behaviour in the context of established performance indicators.
comment: 12 pages, 7 figures, 9 tables, submitted to the IEEE Transactions on Intelligent Vehicles
☆ Neural Exponential Stabilization of Control-affine Nonlinear Systems
This paper proposes a novel learning-based approach for achieving exponential stabilization of nonlinear control-affine systems. We leverage the Control Contraction Metrics (CCMs) framework to co-synthesize Neural Contraction Metrics (NCMs) and Neural Network (NN) controllers. First, we transform the infinite-dimensional semi-definite program (SDP) for CCM computation into a tractable inequality feasibility problem using element-wise bounds of matrix-valued functions. The terms in the inequality can be efficiently computed by our novel algorithms. Second, we propose a free parametrization of NCMs guaranteeing positive definiteness and the satisfaction of a partial differential equation, regardless of trainable parameters. Third, this parametrization and the inequality condition enable the design of contractivity-enforcing regularizers, which can be incorporated while designing the NN controller for exponential stabilization of the underlying nonlinear systems. Furthermore, when the training loss goes to zero, we provide formal guarantees on verification of the NCM and the exponentional stabilization under the NN controller. Finally, we validate our method through benchmark experiments on set-point stabilization and increasing the region of attraction of a locally pre-stabilized closed-loop system.
comment: This paper is submitted in CDC2024 for a possible publication
☆ A PAC-Bayesian Framework for Optimal Control with Stability Guarantees
Stochastic Nonlinear Optimal Control (SNOC) involves minimizing a cost function that averages out the random uncertainties affecting the dynamics of nonlinear systems. For tractability reasons, this problem is typically addressed by minimizing an empirical cost, which represents the average cost across a finite dataset of sampled disturbances. However, this approach raises the challenge of quantifying the control performance against out-of-sample uncertainties. Particularly, in scenarios where the training dataset is small, SNOC policies are prone to overfitting, resulting in significant discrepancies between the empirical cost and the true cost, i.e., the average SNOC cost incurred during control deployment. Therefore, establishing generalization bounds on the true cost is crucial for ensuring reliability in real-world applications. In this paper, we introduce a novel approach that leverages PAC-Bayes theory to provide rigorous generalization bounds for SNOC. Based on these bounds, we propose a new method for designing optimal controllers, offering a principled way to incorporate prior knowledge into the synthesis process, which aids in improving the control policy and mitigating overfitting. Furthermore, by leveraging recent parametrizations of stabilizing controllers for nonlinear systems, our framework inherently ensures closed-loop stability. The effectiveness of our proposed method in incorporating prior knowledge and combating overfitting is shown by designing neural network controllers for tasks in cooperative robotics.
☆ Neural Distributed Controllers with Port-Hamiltonian Structures
Controlling large-scale cyber-physical systems necessitates optimal distributed policies, relying solely on local real-time data and limited communication with neighboring agents. However, finding optimal controllers remains challenging, even in seemingly simple scenarios. Parameterizing these policies using Neural Networks (NNs) can deliver good performance, but their sensitivity to small input changes can destabilize the closed-loop system. This paper addresses this issue for a network of nonlinear dissipative systems. Specifically, we leverage well-established port-Hamiltonian structures to characterize deep distributed control policies with closed-loop stability guarantees and a finite $\mathcal{L}_2$ gain, regardless of specific NN parameters. This eliminates the need to constrain the parameters during optimization and enables training with standard methods like stochastic gradient descent. A numerical study on the consensus control of Kuramoto oscillators demonstrates the effectiveness of the proposed controllers.
comment: This paper is submitted in CDC2024 for a possible publication
☆ Optical Flow Based Detection and Tracking of Moving Objects for Autonomous Vehicles
Accurate velocity estimation of surrounding moving objects and their trajectories are critical elements of perception systems in Automated/Autonomous Vehicles (AVs) with a direct impact on their safety. These are non-trivial problems due to the diverse types and sizes of such objects and their dynamic and random behaviour. Recent point cloud based solutions often use Iterative Closest Point (ICP) techniques, which are known to have certain limitations. For example, their computational costs are high due to their iterative nature, and their estimation error often deteriorates as the relative velocities of the target objects increase (>2 m/sec). Motivated by such shortcomings, this paper first proposes a novel Detection and Tracking of Moving Objects (DATMO) for AVs based on an optical flow technique, which is proven to be computationally efficient and highly accurate for such problems. \textcolor{black}{This is achieved by representing the driving scenario as a vector field and applying vector calculus theories to ensure spatiotemporal continuity.} We also report the results of a comprehensive performance evaluation of the proposed DATMO technique, carried out in this study using synthetic and real-world data. The results of this study demonstrate the superiority of the proposed technique, compared to the DATMO techniques in the literature, in terms of estimation accuracy and processing time in a wide range of relative velocities of moving objects. Finally, we evaluate and discuss the sensitivity of the estimation error of the proposed DATMO technique to various system and environmental parameters, as well as the relative velocities of the moving objects.
comment: This manuscript has been accepted as a regular paper in Transactions on Intelligent Transportation Systems (DOI: 10.1109/TITS.2024.3382495)
☆ On Structural Non-commutativity in Affine Feedback of SISO Nonlinear Systems
The affine feedback connection of SISO nonlinear systems modeled by Chen--Fliess series is shown to be a group action on the plant which is isomorphic to the semi-direct product of shuffle and additive group of non-commutative formal power series. The additive and multiplicative feedback loops in an affine feedback connection are thus proven to be structurally non-commutative. A flip in the order of these loops results in a net additive feedback loop.
comment: submitted to $26^{th}$ International Symposium on Mathematical Theory of Networks and Systems, 2024
☆ Using quantum computers in control: interval matrix properties
Quantum computing provides a powerful framework for tackling computational problems that are classically intractable. The goal of this paper is to explore the use of quantum computers for solving relevant problems in systems and control theory. In the recent literature, different quantum algorithms have been developed to tackle binary optimization, which plays an important role in various control-theoretic problems. As a prototypical example, we consider the verification of interval matrix properties such as non-singularity and stability on a quantum computer. We present a quantum algorithm solving these problems and we study its performance in simulation. Our results demonstrate that quantum computers provide a promising tool for control whose applicability to further computationally complex problems remains to be explored.
comment: Final version, accepted for publication in Proc. European Control Conference (ECC), 2024
☆ Prioritize Team Actions: Multi-Agent Temporal Logic Task Planning with Ordering Constraints
In this paper, we investigate the problem of linear temporal logic (LTL) path planning for multi-agent systems, introducing the new concept of \emph{ordering constraints}. Specifically, we consider a generic objective function that is defined for the path of each individual agent. The primary objective is to find a global plan for the team of agents, ensuring they collectively meet the specified LTL requirements. Simultaneously, we aim to maintain a pre-determined order in the values of the objective function for each agent, which we refer to as the ordering constraints. This new requirement stems from scenarios like security-aware planning, where relative orders outweigh absolute values in importance. We present an efficient algorithm to solve this problem, supported by proofs of correctness that demonstrate the optimality of our solution. Additionally, we provide a case study in security-aware path planning to illustrate the practicality and effectiveness of our proposed approach.
☆ Chattering Phenomena in Time-Optimal Control for High-Order Chain-of-Integrators Systems with Full State Constraints
Time-optimal control for high-order chain-of-integrators systems with full state constraints and arbitrary given terminal states remains an open and challenging problem in optimal control theory domain. However, optimal control's behaviors in high-order problems lack of precision characterization, even where the existence of chattering phenomena remain unknown and overlooked. This paper establishes a theoretical framework of chattering phenomena in the problem, focusing on the uniqueness of state constraints inducing chattering, the upper bound of switching times in an unconstrained arc during chattering, and the convergence of states and costates to the chattering limit point. For the first time, this paper proves the existence of chattering phenomena in the problems. The chattering optimal control for 4th order problems with velocity constraints is precisely solved, providing an approach to plan strictly time-optimal snap-limited trajectories, while other cases of order $n\leq4$ are proved to not allow chattering. The conclusions correct the longstanding misconception in the industry regarding the time-optimality of S-shaped trajectories with minimal switching times.
☆ Ultrafast Adaptive Primary Frequency Tuning and Secondary Frequency Identification for S/S WPT system
Magnetic resonance wireless power transfer (WPT) technology is increasingly being adopted across diverse applications. However, its effectiveness can be significantly compromised by parameter shifts within the resonance network, owing to its high system quality factor. Such shifts are inherent and challenging to mitigate during the manufacturing process. In response, this article introduces a rapid frequency tuning approach. Leveraging switch-controlled capacitors (SCC) to adjust the resonance network and the primary side's operating frequency, alongside a current zero-crossing detection (ZCD) circuit for voltage-current phase determination, this method circumvents the need for intricate knowledge of WPT system parameters. Moreover, it obviates the necessity for inter-side communication for real-time identification of the secondary side resonance frequency. The swift response of SCC and two-step perturb-and-observe algorithm mitigate output disturbances, thereby expediting the frequency tuning process. Experimental validation on a 200W Series-Series compensated WPT (SS-WPT) system demonstrates that the proposed method achieves frequency recognition accuracy within 0.7kHz in less than 1ms, increasing system efficiency up to 9%.
comment: 11 pages,16 figures,to be published in IEEE Transactions on Industrial Electronics
☆ Aerial Robots Carrying Flexible Cables: Dynamic Shape Optimal Control via Spectral Method Model
In this work, we present a model-based optimal boundary control design for an aerial robotic system composed of a quadrotor carrying a flexible cable. The whole system is modeled by partial differential equations (PDEs) combined with boundary conditions described by ordinary differential equations (ODEs). The proper orthogonal decomposition (POD) method is adopted to project the original infinite-dimensional system on a subspace spanned by orthogonal basis functions. Based on the reduced order model, nonlinear model predictive control (NMPC) is implemented online to realize shape trajectory tracking of the flexible cable in an optimal predictive fashion. The proposed reduced modeling and optimal control paradigms are numerically verified against an accurate high-dimensional FDM-based model in different scenarios and the controller's superior performance is shown compared to an optimally tuned PID controller.
☆ Robust Stability for Multiagent Systems with Spatio-Temporally Correlated Packet Loss
A problem with considering correlations in the analysis of multiagent system with stochastic packet loss is that they induce dependencies between agents that are otherwise decoupled, preventing the application of decomposition methods required for efficient evaluation. To circumvent that issue, this paper is proposing an approach based on analysing sets of networks with independent communication links, only considering the correlations in an implicit fashion. Combining ideas from the robust stabilization of Markov jump linear systems with recently proposed techniques for analysing packet loss in multiagent systems, we obtain a linear matrix inequality based stability condition which is independent of the number of agents. The main result is that the set of stabilized probability distributions has non-empty interior such that small correlations cannot lead to instability, even though only distributions of independent links were analysed. Moreover, two examples are provided to demonstrate the applicability of the results to practically relevant scenarios.
comment: 7 pages, 2 figures, 1 table
☆ Cyclic pursuit formation control for arbitrary desired shapes
A multi-agent system comprises numerous agents that autonomously make decisions to collectively accomplish tasks, drawing significant attention for their wide-ranging applications. Within this context, formation control emerges as a prominent task, wherein agents collaboratively shape and maneuver while preserving formation integrity. Our focus centers on cyclic pursuit, a method facilitating the formation of circles, ellipses, and figure-eights under the assumption that agents can only perceive the relative positions of those preceding them. However, this method's scope has been restricted to these specific shapes, leaving the feasibility of forming other shapes uncertain. In response, our study proposes a novel method based on cyclic pursuit capable of forming a broader array of shapes, enabling agents to individually shape while pursuing preceding agents, thereby extending the repertoire of achievable formations. We present two scenarios concerning the information available to agents and devise formation control methods tailored to each scenario. Through extensive simulations, we demonstrate the efficacy of our proposed method in forming multiple shapes, including those represented as Fourier series, thereby underscoring the versatility and effectiveness of our approach.
☆ Natural-artificial hybrid swarm: Cyborg-insect group navigation in unknown obstructed soft terrain
Navigating multi-robot systems in complex terrains has always been a challenging task. This is due to the inherent limitations of traditional robots in collision avoidance, adaptation to unknown environments, and sustained energy efficiency. In order to overcome these limitations, this research proposes a solution by integrating living insects with miniature electronic controllers to enable robotic-like programmable control, and proposing a novel control algorithm for swarming. Although these creatures, called cyborg insects, have the ability to instinctively avoid collisions with neighbors and obstacles while adapting to complex terrains, there is a lack of literature on the control of multi-cyborg systems. This research gap is due to the difficulty in coordinating the movements of a cyborg system under the presence of insects' inherent individual variability in their reactions to control input. In response to this issue, we propose a novel swarm navigation algorithm addressing these challenges. The effectiveness of the algorithm is demonstrated through an experimental validation in which a cyborg swarm was successfully navigated through an unknown sandy field with obstacles and hills. This research contributes to the domain of swarm robotics and showcases the potential of integrating biological organisms with robotics and control theory to create more intelligent autonomous systems with real-world applications.
☆ Reverse Kron reduction of Multi-phase Radial Network
We consider the problem of identifying the admittance matrix of a three-phase radial network from voltage and current measurements at a subset of nodes. These measurements are used to estimate a virtual network represented by the Kron reduction (Schur complement) of the full admittance matrix. We focus on recovering exactly the full admittance matrix from its Kron reduction, i.e., computing the inverse of Schur complement. The key idea is to decompose Kron reduction into a sequence of iterations that maintains an invariance structure, and exploit this structure to reverse each step of the iterative Kron reduction.
☆ A Moreau Envelope Approach for LQR Meta-Policy Estimation
We study the problem of policy estimation for the Linear Quadratic Regulator (LQR) in discrete-time linear time-invariant uncertain dynamical systems. We propose a Moreau Envelope-based surrogate LQR cost, built from a finite set of realizations of the uncertain system, to define a meta-policy efficiently adjustable to new realizations. Moreover, we design an algorithm to find an approximate first-order stationary point of the meta-LQR cost function. Numerical results show that the proposed approach outperforms naive averaging of controllers on new realizations of the linear system. We also provide empirical evidence that our method has better sample complexity than Model-Agnostic Meta-Learning (MAML) approaches.
comment: 8 pages
☆ Reinforcement Learning-based Receding Horizon Control using Adaptive Control Barrier Functions for Safety-Critical Systems
Optimal control methods provide solutions to safety-critical problems but easily become intractable. Control Barrier Functions (CBFs) have emerged as a popular technique that facilitates their solution by provably guaranteeing safety, through their forward invariance property, at the expense of some performance loss. This approach involves defining a performance objective alongside CBF-based safety constraints that must always be enforced. Unfortunately, both performance and solution feasibility can be significantly impacted by two key factors: (i) the selection of the cost function and associated parameters, and (ii) the calibration of parameters within the CBF-based constraints, which capture the trade-off between performance and conservativeness. %as well as infeasibility. To address these challenges, we propose a Reinforcement Learning (RL)-based Receding Horizon Control (RHC) approach leveraging Model Predictive Control (MPC) with CBFs (MPC-CBF). In particular, we parameterize our controller and use bilevel optimization, where RL is used to learn the optimal parameters while MPC computes the optimal control input. We validate our method by applying it to the challenging automated merging control problem for Connected and Automated Vehicles (CAVs) at conflicting roadways. Results demonstrate improved performance and a significant reduction in the number of infeasible cases compared to traditional heuristic approaches used for tuning CBF-based controllers, showcasing the effectiveness of the proposed method.
☆ Destination-Constrained Linear Dynamical System Modeling in Set-Valued Frameworks
Directional motion towards a specified destination is a common occurrence in physical processes and human societal activities. Utilizing this prior information can significantly improve the control and predictive performance of system models. This paper primarily focuses on reconstructing linear dynamic system models based on destination constraints in the set-valued framework. We treat destination constraints as inherent information in the state evolution process and employ convex optimization techniques to construct a coherent and robust state model. This refined model effectively captures the impact of destination constraints on the state evolution at each time step. Furthermore, we design an optimal weight matrix for the reconstructed model to ensure smoother and more natural trajectories of state evolution. We also analyze the theoretical guarantee of optimality for this weight matrix and the properties of the reconstructed model. Finally, simulation experiments verify that the reconstructed model has significant advantages over the unconstrained and unoptimized weighted models and constrains the evolution of state trajectories with different starting and ending points.
comment: 15 pages, 11 figures
♻ ☆ Optimal Control of Grid-Interfacing Inverters With Current Magnitude Limits
Grid-interfacing inverters act as the interface between renewable resources and the electric grid, and have the potential to offer fast and programmable controls compared to synchronous generators. With this flexibility there has been significant research efforts into determining the best way to control these inverters. Inverters are limited in their maximum current output in order to protect semiconductor devices, presenting a nonlinear constraint that needs to be accounted for in their control algorithms. Existing approaches either simply saturate a controller that is designed for unconstrained systems, or assume small perturbations and linearize a saturated system. These approaches can lead to stability issues or limiting the control actions to be too conservative. In this paper, we directly focus on a nonlinear system that explicitly accounts for the saturation of the current magnitude. We use a Lyapunov stability approach to determine a stability condition for the system, guaranteeing that a class of controllers would be stabilizing if they satisfy a simple SDP condition. With this condition we fit a linear-feedback controller by sampling the output (offline) model predictive control problems. This learned controller has improved performances with existing designs.
comment: 6 pages, 6 figures, 1 table. Submitted to CDC'2024
♻ ☆ Faster Run-to-Run Feedforward Control of Electromechanical Switching Devices: a Sensitivity-Based Approach
Electromechanical switching devices, such as solenoid valves, contactors, and relays, suffer from undesirable phenomena like clicking, mechanical wear, and contact bounce. Despite that, they are still widely used in industry due to their various economic and technical advantages. This has encouraged the development of controllers aimed at reducing the collisions that occur at the end of the switching operations. One of the most successful approaches has been the use of iterative techniques. However, these algorithms typically require a large number of operations to converge, which is definitely a clear drawback. This paper presents a strategy to improve the convergence rate of such controllers. Our proposal, which is based on the sensitivity of the control law with respect to the parameters, assumes that the performance of the system is more heavily affected by some parameters than others. Thus, by avoiding movements in the directions that have less impact, the search algorithm is expected to drive the system to near-optimal behaviors using fewer operations. Results obtained by simulation show significant improvement in the convergence rate of a state-of-the-art run-to-run feedforward controller, which demonstrates the high potential of the proposal.
comment: 6 pages, 4 figures. Final version, after peer review and acceptance, submitted to the 22nd European Control Conference (ECC)
♻ ☆ Resilient source seeking with robot swarms
We present a solution for locating the source, or maximum, of an unknown scalar field using a swarm of mobile robots. Unlike relying on the traditional gradient information, the swarm determines an ascending direction to approach the source with arbitrary precision. The ascending direction is calculated from measurements of the field strength at the robot locations and their relative positions concerning the centroid. Rather than focusing on individual robots, we focus the analysis on the density of robots per unit area to guarantee a more resilient swarm, i.e., the functionality remains even if individuals go missing or are misplaced during the mission. We reinforce the robustness of the algorithm by providing sufficient conditions for the swarm shape so that the ascending direction is almost parallel to the gradient. The swarm can respond to an unexpected environment by morphing its shape and exploiting the existence of multiple ascending directions. Finally, we validate our approach numerically with hundreds of robots. The fact that a large number of robots always calculate an ascending direction compensates for the loss of individuals and mitigates issues arising from the actuator and sensor noises.
comment: 7 pages, submitted to CDC 2024
♻ ☆ Frequency Regulation with Storage: On Losses and Profits
Low-carbon societies will need to store vast amounts of electricity to balance intermittent generation from wind and solar energy, for example, through frequency regulation. Here, we derive an analytical solution to the decision-making problem of storage operators who sell frequency regulation power to grid operators and trade electricity on day-ahead markets. Mathematically, we treat future frequency deviation trajectories as functional uncertainties in a receding horizon robust optimization problem. We constrain the expected terminal state-of-charge to be equal to some target to allow storage operators to make good decisions not only for the present but also the future. Thanks to this constraint, the amount of electricity traded on day-ahead markets is an implicit function of the regulation power sold to grid operators. The implicit function quantifies the amount of power that needs to be purchased to cover the expected energy loss that results from providing frequency regulation. We show how the marginal cost associated with the expected energy loss decreases with roundtrip efficiency and increases with frequency deviation dispersion. We find that the profits from frequency regulation over the lifetime of energy-constrained storage devices are roughly inversely proportional to the length of time for which regulation power must be committed.
♻ ☆ A Safe Preference Learning Approach for Personalization with Applications to Autonomous Vehicles
This work introduces a preference learning method that ensures adherence to given specifications, with an application to autonomous vehicles. Our approach incorporates the priority ordering of Signal Temporal Logic (STL) formulas describing traffic rules into a learning framework. By leveraging Parametric Weighted Signal Temporal Logic (PWSTL), we formulate the problem of safety-guaranteed preference learning based on pairwise comparisons and propose an approach to solve this learning problem. Our approach finds a feasible valuation for the weights of the given PWSTL formula such that, with these weights, preferred signals have weighted quantitative satisfaction measures greater than their non-preferred counterparts. The feasible valuation of weights given by our approach leads to a weighted STL formula that can be used in correct-and-custom-by-construction controller synthesis. We demonstrate the performance of our method with a pilot human subject study in two different simulated driving scenarios involving a stop sign and a pedestrian crossing. Our approach yields competitive results compared to existing preference learning methods in terms of capturing preferences and notably outperforms them when safety is considered.
comment: 9 pages, 3 figures, 2 tables. This work has been published at IEEE Robotics and Automation Letters. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Ensuring Disturbance Rejection Performance by Synthesizing Grid-Following and Grid-Forming Inverters in Power Systems
To satisfy dynamic requirements of power systems, it is imperative for grid-tied inverters to ensure good disturbance rejection performance (DRP) under variable grid conditions. This letter discovers and theoretically proves that for general networks, synthesizing grid-following (GFL) inverters and grid-forming (GFM) inverters can always more effectively ensure the DRP of multiple inverters, as compared to homogeneous inverter-based systems that solely utilize either GFL or GFM inverters. The synthesis of GFL inverters and GFM inverters can concurrently increase the smallest eigenvalue and decrease the largest eigenvalue of the network grounded Laplacian matrix. This can be equivalent to rematching the proper short-circuit ratio (SCR) for GFL and GFM inverters, thereby ensuring the DRP of inverters both in weak and strong grids. The results reveal the necessity of synthesizing diverse inverter control schemes from the network-based perspective. Sensitivity function-based tests and real-time simulations validate our results.
comment: 5 pages
♻ ☆ Closing the Gap to Quadratic Invariance: a Regret Minimization Approach to Optimal Distributed Control
In this work, we focus on the design of optimal controllers that must comply with an information structure. State-of-the-art approaches do so based on the H2 or Hinfty norm to minimize the expected or worst-case cost in the presence of stochastic or adversarial disturbances. Large-scale systems often experience a combination of stochastic and deterministic disruptions (e.g., sensor failures, environmental fluctuations) that spread across the system and are difficult to model precisely, leading to sub-optimal closed-loop behaviors. Hence, we propose improving performance for these scenarios by minimizing the regret with respect to an ideal policy that complies with less stringent sensor-information constraints. This endows our controller with the ability to approach the improved behavior of a more informed policy, which would detect and counteract heterogeneous and localized disturbances more promptly. Specifically, we derive convex relaxations of the resulting regret minimization problem that are compatible with any desired controller sparsity, while we reveal a renewed role of the Quadratic Invariance (QI) condition in designing informative benchmarks to measure regret. Last, we validate our proposed method through numerical simulations on controlling a multi-agent distributed system, comparing its performance with traditional H2 and Hinfty policies.
comment: Accepted for presentation and publication in the proceedings of the 2024 European Control Conference (ECC)
♻ ☆ Time-Optimal Control for High-Order Chain-of-Integrators Systems with Full State Constraints and Arbitrary Terminal States (Extended Version)
Time-optimal control for high-order chain-of-integrators systems with full state constraints and arbitrary given terminal states remains a challenging problem in the optimal control theory domain, yet to be resolved. To enhance further comprehension of the problem, this paper establishes a novel notation system and theoretical framework, providing the switching manifold for high-order problems in the form of switching laws. Through deriving properties of switching laws on signs and dimension, this paper proposes a definite condition for time-optimal control. Guided by the developed theory, a trajectory planning method named the manifold-intercept method (MIM) is developed. The proposed MIM can plan time-optimal jerk-limited trajectories with full state constraints, and can also plan near-optimal non-chattering higher-order trajectories with negligible extra motion time compared to optimal profiles. Numerical results indicate that the proposed MIM outperforms all baselines in computational time, computational accuracy, and trajectory quality by a large gap.
♻ ☆ Discretized Distributed Optimization over Dynamic Digraphs
We consider a discrete-time model of continuous-time distributed optimization over dynamic directed-graphs (digraphs) with applications to distributed learning. Our optimization algorithm works over general strongly connected dynamic networks under switching topologies, e.g., in mobile multi-agent systems and volatile networks due to link failures. Compared to many existing lines of work, there is no need for bi-stochastic weight designs on the links. The existing literature mostly needs the link weights to be stochastic using specific weight-design algorithms needed both at the initialization and at all times when the topology of the network changes. This paper eliminates the need for such algorithms and paves the way for distributed optimization over time-varying digraphs. We derive the bound on the gradient-tracking step-size and discrete time-step for convergence and prove dynamic stability using arguments from consensus algorithms, matrix perturbation theory, and Lyapunov theory. This work, particularly, is an improvement over existing stochastic-weight undirected networks in case of link removal or packet drops. This is because the existing literature may need to rerun time-consuming and computationally complex algorithms for stochastic design, while the proposed strategy works as long as the underlying network is weight-symmetric and balanced. The proposed optimization framework finds applications to distributed classification and learning.
♻ ☆ When Robotics Meets Wireless Communications: An Introductory Tutorial
The importance of ground Mobile Robots (MRs) and Unmanned Aerial Vehicles (UAVs) within the research community, industry, and society is growing fast. Many of these agents are nowadays equipped with communication systems that are, in some cases, essential to successfully achieve certain tasks. In this context, we have begun to witness the development of a new interdisciplinary research field at the intersection of robotics and communications. This research field has been boosted by the intention of integrating UAVs within the 5G and 6G communication networks. This research will undoubtedly lead to many important applications in the near future. Nevertheless, one of the main obstacles to the development of this research area is that most researchers address these problems by oversimplifying either the robotics or the communications aspect. This impedes the ability of reaching the full potential of this new interdisciplinary research area. In this tutorial, we present some of the modelling tools necessary to address problems involving both robotics and communication from an interdisciplinary perspective. As an illustrative example of such problems, we focus in this tutorial on the issue of communication-aware trajectory planning.
comment: 35 pages, 192 references
♻ ☆ Stable Linear Subspace Identification: A Machine Learning Approach
Machine Learning (ML) and linear System Identification (SI) have been historically developed independently. In this paper, we leverage well-established ML tools - especially the automatic differentiation framework - to introduce SIMBa, a family of discrete linear multi-step-ahead state-space SI methods using backpropagation. SIMBa relies on a novel Linear-Matrix-Inequality-based free parametrization of Schur matrices to ensure the stability of the identified model. We show how SIMBa generally outperforms traditional linear state-space SI methods, and sometimes significantly, although at the price of a higher computational burden. This performance gap is particularly remarkable compared to other SI methods with stability guarantees, where the gain is frequently above 25% in our investigations, hinting at SIMBa's ability to simultaneously achieve state-of-the-art fitting performance and enforce stability. Interestingly, these observations hold for a wide variety of input-output systems and on both simulated and real-world data, showcasing the flexibility of the proposed approach. We postulate that this new SI paradigm presents a great extension potential to identify structured nonlinear models from data, and we hence open-source SIMBa on https://github.com/Cemempamoi/simba.
comment: Accepted at ECC 2024
♻ ☆ Modelling Irrational Behaviour of Residential End Users using Non-Stationary Gaussian Processes
Demand response (DR) plays a critical role in ensuring efficient electricity consumption and optimal use of network assets. Yet, existing DR models often overlook a crucial element, the irrational behaviour of electricity end users. In this work, we propose a price-responsive model that incorporates key aspects of end-user irrationality, specifically loss aversion, time inconsistency, and bounded rationality. To this end, we first develop a framework that uses Multiple Seasonal-Trend decomposition using Loess (MSTL) and non-stationary Gaussian processes to model the randomness in the electricity consumption by residential consumers. The impact of this model is then evaluated through a community battery storage (CBS) business model. Additionally, we apply a chance-constrained optimisation model for CBS operation that deals with the unpredictability of the end-user irrationality. Our simulations using real-world data show that the proposed DR model provides a more realistic estimate of end-user price-responsive behaviour when considering irrationality. Compared to a deterministic model that cannot fully take into account the irrational behaviour of end users, the chance-constrained CBS operation model yields an additional 19% revenue. Lastly, the business model reduces the electricity costs of solar end users by 11%.
comment: This manuscript has been accepted for publication in IEEE Transactions on Smart Grid
♻ ☆ Full Attitude Intelligent Controller Design of a Heliquad under Complete Failure of an Actuator
In this paper, we design a reliable Heliquad and develop an intelligent controller to handle one actuators complete failure. Heliquad is a multi-copter similar to Quadcopter, with four actuators diagonally symmetric from the center. Each actuator has two control inputs; the first input changes the propeller blades collective pitch (also called variable pitch), and the other input changes the rotation speed. For reliable operation and high torque characteristic requirement for yaw control, a cambered airfoil is used to design propeller blades. A neural network-based control allocation is designed to provide complete control authority even under a complete loss of one actuator. Nonlinear quaternion based outer loop position control, with proportional-derivative inner loop for attitude control and neural network-based control allocation is used in controller design. The proposed controller and Heliquad designs performance is evaluated using a software-in-loop simulation to track the position reference command under failure. The results clearly indicate that the Heliquad with an intelligent controller provides necessary tracking performance even under a complete loss of one actuator.
comment: 7 pages, For video go to https://indianinstituteofscience-my.sharepoint.com/:v:/g/personal/eeshank_iisc_ac_in/EcMg2uTtE91AsHDejNkb6YMBNckaXGjeh_YMzDV6sAHZAQ?e=DrRqmN
♻ ☆ Autonomous Hook-Based Grasping and Transportation with Quadcopters
Payload grasping and transportation with quadcopters is an active research area that has rapidly developed over the last decade. To grasp a payload without human interaction, most state-of-the-art approaches apply robotic arms that are attached to the quadcopter body. However, due to the large weight and power consumption of these aerial manipulators, their agility and flight time are limited. This paper proposes a motion control and planning method for transportation with a lightweight, passive manipulator structure that consists of a hook attached to a quadrotor using a 1 DoF revolute joint. To perform payload grasping, transportation, and release, first, time-optimal reference trajectories are designed through specific waypoints to ensure the fast and reliable execution of the tasks. Then, a two-stage motion control approach is developed based on a robust geometric controller for precise and reliable reference tracking and a linear--quadratic payload regulator for rapid setpoint stabilization of the payload swing. Furthermore, stability of the closed-loop system is mathematically proven to give safety guarantee for its operation. The proposed control architecture and design are evaluated in a high-fidelity physical simulator, and also in real flight experiments, using a custom-made quadrotor--hook manipulator platform.
♻ ☆ Robustness Evaluation of Localization Techniques for Autonomous Racing
This work introduces SynPF, an MCL-based algorithm tailored for high-speed racing environments. Benchmarked against Cartographer, a state-of-the-art pose-graph SLAM algorithm, SynPF leverages synergies from previous particle-filtering methods and synthesizes them for the high-performance racing domain. Our extensive in-field evaluations reveal that while Cartographer excels under nominal conditions, it struggles when subjected to wheel-slip, a common phenomenon in a racing scenario due to varying grip levels and aggressive driving behaviour. Conversely, SynPF demonstrates robustness in these challenging conditions and a low-latency computation time of 1.25 ms on on-board computers without a GPU. Using the F1TENTH platform, a 1:10 scaled autonomous racing vehicle, this work not only highlights the vulnerabilities of existing algorithms in high-speed scenarios, tested up until 7.6 m/s, but also emphasizes the potential of SynPF as a viable alternative, especially in deteriorating odometry conditions.
comment: Accepted at the Design, Automation and Test in Europe Conference 2024 as an extended abstract
♻ ☆ Credit vs. Discount-Based Congestion Pricing: A Comparison Study
Tolling, or congestion pricing, offers a promising traffic management policy for regulating congestion, but has also attracted criticism for placing outsized financial burdens on low-income users. Credit-based congestion pricing (CBCP) and discount-based congestion pricing (DBCP) policies, which respectively provide travel credits and toll discounts to low-income users on tolled roads, have emerged as promising mechanisms for reducing traffic congestion without worsening societal inequities. However, the optimal design of CBCP and DBCP policies, as well as their relative advantages and disadvantages, remain poorly understood. To address this, we study the effects of implementing CBCP and DBCP policies to route users on a network of multi-lane highways with tolled express lanes. We formulate a non-atomic routing game framework in which a subset of eligible users is granted toll relief in the form of a fixed budget or toll discount, while the remaining ineligible users must pay out-of-pocket. We prove the existence of Nash equilibrium traffic flow patterns corresponding to any given CBCP or DBCP policy. Under the additional assumption that eligible users have time-invariant VoTs, we provide a convex program to efficiently compute these equilibria. For networks consisting of a single edge, we identify conditions under which CBCP policies outperform DBCP policies (and vice versa), in the sense of improving eligible users' access to the express lane. Finally, we present empirical results from a CBCP pilot study of the San Mateo 101 Express Lane Project in California. Our empirical results corroborate our theoretical analysis of the impact of deploying credit-based and discount-based policies, and lend insights into the sensitivity of their impact with respect to the travel demand and users' VoTs.
♻ ☆ Achievable Sum Rate Optimization on NOMA-aided Cell-Free Massive MIMO with Finite Blocklength Coding
Non-orthogonal multiple access (NOMA)-aided cell-free massive multiple-input multiple-output (CFmMIMO) has been considered as a promising technology to fulfill strict quality of service requirements for ultra-reliable low-latency communications (URLLC). However, finite blocklength coding (FBC) in URLLC makes it challenging to achieve the optimal performance in the NOMA-aided CFmMIMO system. In this paper, we investigate the performance of the NOMA-aided CFmMIMO system with FBC in terms of achievable sum rate (ASR). Firstly, we derive a lower bound (LB) on the ergodic data rate. Then, we formulate an ASR maximization problem by jointly considering power allocation and user equipment (UE) clustering. To tackle such an intractable problem, we decompose it into two sub-problems, i.e., the power allocation problem and the UE clustering problem. A successive convex approximation (SCA) algorithm is proposed to solve the power allocation problem by transforming it into a series of geometric programming problems. Meanwhile, two algorithms based on graph theory are proposed to solve the UE clustering problem by identifying negative loops. Finally, alternative optimization is performed to find the maximum ASR of the NOMA-aided CFmMIMO system with FBC. The simulation results demonstrate that the proposed algorithms significantly outperform the benchmark algorithms in terms of ASR under various scenarios.
Databases
☆ Empirical Analysis of EIP-3675: Miner Dynamics, Transaction Fees, and Transaction Time
The Ethereum Improvement Proposal 3675 (EIP-3675) marks a significant shift, transitioning from a Proof of Work (PoW) to a Proof of Stake (PoS) consensus mechanism. This transition resulted in a staggering 99.95% decrease in energy consumption. However, the transition prompts two critical questions: (1). How does EIP-3675 affect miners' dynamics? and (2). How do users determine priority fees, considering that paying too little may cause delays or non-inclusion, yet paying too much wastes money with little to no benefits? To address the first question, we present a comprehensive empirical study examining EIP-3675's effect on miner dynamics (i.e., miner participation, distribution, and the degree of randomness in miner selection). Our findings reveal that the transition has encouraged broader participation of miners in block append operation, resulting in a larger pool of unique miners ($\approx50\times$ PoW), and the change in miner distribution with the increased number of unique small category miners ($\approx60\times$ PoW). However, there is an unintended consequence: a reduction in the miner selection randomness, which signifies the negative impact of the transition to PoS-Ethereum on network decentralization. Regarding the second question, we employed regression-based machine learning models; the Gradient Boosting Regressor performed best in predicting priority fees, while the K-Neighbours Regressor was worst.
comment: This paper has been accepted as a full paper at the IEEE International Conference on Blockchain and Cryptocurrency (ICBC) 2024
☆ Query Refinement for Diverse Top-$k$ Selection
Database queries are often used to select and rank items as decision support for many applications. As automated decision-making tools become more prevalent, there is a growing recognition of the need to diversify their outcomes. In this paper, we define and study the problem of modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined notion of diversity while simultaneously maintaining the intent of the original query. We show the hardness of this problem and propose a Mixed Integer Linear Programming (MILP) based solution. We further present optimizations designed to enhance the scalability and applicability of the solution in real-life scenarios. We investigate the performance characteristics of our algorithm and show its efficiency and the usefulness of our optimizations.
☆ Towards a FAIR Documentation of Workflows and Models in Applied Mathematics
Modeling-Simulation-Optimization workflows play a fundamental role in applied mathematics. The Mathematical Research Data Initiative, MaRDI, responded to this by developing a FAIR and machine-interpretable template for a comprehensive documentation of such workflows. MaRDMO, a Plugin for the Research Data Management Organiser, enables scientists from diverse fields to document and publish their workflows on the MaRDI Portal seamlessly using the MaRDI template. Central to these workflows are mathematical models. MaRDI addresses them with the MathModDB ontology, offering a structured formal model description. Here, we showcase the interaction between MaRDMO and the MathModDB Knowledge Graph through an algebraic modeling workflow from the Digital Humanities. This demonstration underscores the versatility of both services beyond their original numerical domain.
☆ When View- and Conflict-Robustness Coincide for Multiversion Concurrency Control
A DBMS allows trading consistency for efficiency through the allocation of isolation levels that are strictly weaker than serializability. The robustness problem asks whether, for a given set of transactions and a given allocation of isolation levels, every possible interleaved execution of those transactions that is allowed under the provided allocation, is always safe. In the literature, safe is interpreted as conflict-serializable (to which we refer here as conflict-robustness). In this paper, we study the view-robustness problem, interpreting safe as view-serializable. View-serializability is a more permissive notion that allows for a greater number of schedules to be serializable and aligns more closely with the intuitive understanding of what it means for a database to be consistent. However, view-serializability is more complex to analyze (e.g., conflict-serializability can be decided in polynomial time whereas deciding view-serializability is NP-complete). While conflict-robustness implies view-robustness, the converse does not hold in general. In this paper, we provide a sufficient condition for isolation levels guaranteeing that conflict- and view-robustness coincide and show that this condition is satisfied by the isolation levels occurring in Postgres and Oracle: read committed (RC), snapshot isolation (SI) and serializable snapshot isolation (SSI). It hence follows that for these systems, widening from conflict- to view-serializability does not allow for more sets of transactions to become robust. Interestingly, the complexity of deciding serializability within these isolation levels is still quite different. Indeed, deciding conflict-serializability for schedules allowed under RC and SI remains in polynomial time, while we show that deciding view-serializability within these isolation levels remains NP-complete.
comment: 22 pages
☆ Geometric planted matchings beyond the Gaussian model
We consider the problem of recovering an unknown matching between a set of $n$ randomly placed points in $\mathbb{R}^d$ and random perturbations of these points. This can be seen as a model for particle tracking and more generally, entity resolution. We use matchings in random geometric graphs to derive minimax lower bounds for this problem that hold under great generality. Using these results we show that for a broad class of distributions, the order of the number of mistakes made by an estimator that minimizes the sum of squared Euclidean distances is minimax optimal when $d$ is fixed and is optimal up to $n^{o(1)}$ factors when $d = o(\log n)$. In the high-dimensional regime we consider a setup where both initial positions and perturbations have independent sub-Gaussian coordinates. In this setup we give sufficient conditions under which the same estimator makes no mistakes with high probability. We prove an analogous result for an adapted version of this estimator that incorporates information on the covariance matrix of the perturbations.
comment: 36 pages, 2 figures
☆ Disambiguate Entity Matching through Relation Discovery with Large Language Models
Entity matching is a critical challenge in data integration and cleaning, central to tasks like fuzzy joins and deduplication. Traditional approaches have focused on overcoming fuzzy term representations through methods such as edit distance, Jaccard similarity, and more recently, embeddings and deep neural networks, including advancements from large language models (LLMs) like GPT. However, the core challenge in entity matching extends beyond term fuzziness to the ambiguity in defining what constitutes a "match," especially when integrating with external databases. This ambiguity arises due to varying levels of detail and granularity among entities, complicating exact matches. We propose a novel approach that shifts focus from purely identifying semantic similarities to understanding and defining the "relations" between entities as crucial for resolving ambiguities in matching. By predefining a set of relations relevant to the task at hand, our method allows analysts to navigate the spectrum of similarity more effectively, from exact matches to conceptually related entities.
♻ ☆ A Decade of Scholarly Research on Open Knowledge Graphs LREC
The proliferation of open knowledge graphs has led to a surge in scholarly research on the topic over the past decade. This paper presents a bibliometric analysis of the scholarly literature on open knowledge graphs published between 2013 and 2023. The study aims to identify the trends, patterns, and impact of research in this field, as well as the key topics and research questions that have emerged. The work uses bibliometric techniques to analyze a sample of 4445 scholarly articles retrieved from Scopus. The findings reveal an ever-increasing number of publications on open knowledge graphs published every year, particularly in developed countries (+50 per year). These outputs are published in highly-referred scholarly journals and conferences. The study identifies three main research themes: (1) knowledge graph construction and enrichment, (2) evaluation and reuse, and (3) fusion of knowledge graphs into NLP systems. Within these themes, the study identifies specific tasks that have received considerable attention, including entity linking, knowledge graph embedding, and graph neural networks.
comment: Camera-ready edition for LREC-COLING 2024
♻ ☆ Unveiling the Pitfalls of Knowledge Editing for Large Language Models ICLR 2024
As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code and data are available at https://github.com/zjunlp/PitfallsKnowledgeEditing.
comment: ICLR 2024
♻ ☆ ByteCard: Enhancing ByteDance's Data Warehouse with Learned Cardinality Estimation
Cardinality estimation is a critical component and a longstanding challenge in modern data warehouses. ByteHouse, ByteDance's cloud-native engine for big data analysis in exabyte-scale environments, serves numerous internal decision-making business scenarios. With the increasing demand of ByteHouse, cardinality estimation becomes the bottleneck for efficiently processing queries. Specifically, the existing query optimizer of ByteHouse uses the traditional Selinger-like cardinality estimator, which can produce huge estimation errors, resulting in sub-optimal query plans. To improve cardinality estimation accuracy while maintaining a practical inference overhead, we develop ByteCard framework that enables efficient training/updating and integration of cardinality estimators. Furthermore, ByteCard adapts recent advances in cardinality estimation to build models that can balance accuracy and practicality (e.g., inference latency, model size, training/updating overhead). We observe significant query processing speed-up in ByteHouse after replacing the system's existing cardinality estimation with ByteCard's estimations for several optimization strategies. Evaluations on real-world datasets show the integration of ByteCard leads to an improvement of up to 30% in the 99th quantile of latency. At last, we share our valuable experience in engineering advanced cardinality estimators. We believe this experience can help other data warehouses integrate more accurate and sophisticated solutions on the critical path of query execution.
♻ ☆ A proof-of-concept online metadata catalogue service of Earth observation datasets for human health research in exposomics
This article describes research carried out during 2023 under an International Society for Photogrammetry and Remote Sensing (ISPRS)-funded project to develop and disseminate a metadata catalogue of Earth observation data sources/products and types that are relevant to human health research in exposomics, as a free service to interested researchers worldwide. The proof-of-concept catalogue was informed by input from existing research literature on the subject (desk research), as well as online communications with, and relevant research publications collected from, a small panel (n = 5) of select experts from the academia in three countries (China, UK and USA). It has 90 metadata records of relevant Earth observation datasets (n = 40) and associated health-focused research publications (n = 50). The project's online portal offers a searchable version of the catalogue featuring a number of search modes and filtering options. It is hoped future, more comprehensive versions of this service will enable more researchers and studies to discover and use remote sensing data about population-level exposures to disease determinants (exposomic determinants of disease) in combination with other relevant data to reveal fresh insights that could improve our understanding of relevant diseases, and hence contribute to the development of better-optimized prevention and management plans to tackle them.
comment: 5 figures
Emerging Technologies
☆ Fermihedral: On the Optimal Compilation for Fermion-to-Qubit Encoding
This paper introduces Fermihedral, a compiler framework focusing on discovering the optimal Fermion-to-qubit encoding for targeted Fermionic Hamiltonians. Fermion-to-qubit encoding is a crucial step in harnessing quantum computing for efficient simulation of Fermionic quantum systems. Utilizing Pauli algebra, Fermihedral redefines complex constraints and objectives of Fermion-to-qubit encoding into a Boolean Satisfiability problem which can then be solved with high-performance solvers. To accommodate larger-scale scenarios, this paper proposed two new strategies that yield approximate optimal solutions mitigating the overhead from the exponentially large number of clauses. Evaluation across diverse Fermionic systems highlights the superiority of Fermihedral, showcasing substantial reductions in implementation costs, gate counts, and circuit depth in the compiled circuits. Real-system experiments on IonQ's device affirm its effectiveness, notably enhancing simulation accuracy.
☆ Analysis on reservoir activation with the nonlinearity harnessed from solution-processed MoS2 devices
Reservoir computing is a recurrent neural network that has been applied across various domains in machine learning. The implementation of reservoir computing, however, often demands heavy computations for activating the reservoir. Configuring physical reservoir networks and harnessing the nonlinearity from the underlying devices for activation is an emergent solution to address the computational challenge. Herein, we analyze the feasibility of employing the nonlinearity from solution-processed molybdenum disulfide (MoS2) devices for reservoir activation. The devices, fabricated using liquid-phase exfoliated MoS2, exhibit a high-order nonlinearity achieved by Stark modulation of the MoS2 material. We demonstrate that this nonlinearity can be fitted and employed as the activation function to facilitate reservoir computing implementation. Notably, owing to the high-order nonlinearity, the network exhibits long-term synchronization and robust generalization abilities for approximating complex dynamical systems. Given the remarkable reservoir activation capability, coupled with the scalability of the device fabrication, our findings open the possibility for the physical realization of lightweight, efficient reservoir computing for, for instance, signal classification, motion tracking, and pattern recognition of complex time series as well as secure cryptography. As an example, we show the network can be appointed to generate chaotic random numbers for secure data encryption.
☆ Towards a Dutch hybrid quantum/HPC infrastructure
Quantum Inspire has taken important steps to enable quantum applications by developing a setting that allows the execution of hybrid algorithms. Currently, the setting uses a classical server (HPC node) co-located with the quantum computer for the high frequency coupling needed by hybrid algorithms. A fast task manager (dispatcher) has been developed to orchestrate the interaction between the server and the quantum computer. Although successful, the setting imposes a specific hybrid job-structure. This is most likely always going to be the case and we are currently discussing how to make sure this does not hamper the uptake of the setting. Furthermore, first steps have been taken towards the integration with the Dutch National High-Performance Computing (HPC) Center, hosted by SURF. As a first approach we have setup a setting consisting of two SLURM clusters, one in the HPC (C1) and the second (C2) co-located with Quantum Inspire API. Jobs are submitted from C1 to C2. Quantum Inspire can then schedule with C2 the jobs to the quantum computer. With this setting, we enable control from both SURF and Quantum Inspire on the jobs being executed. By using C1 for the jobs submission we remove the accounting burden from Quantum Inspire. By having C2 co-located with Quantum Inspire API, we make the setting more resilient towards network failures. This setting can be extended for other HPC centers to submit jobs to Quantum Inspire backends.
comment: 6 pages, 3 figures
♻ ☆ Large Language Model for Multi-objective Evolutionary Optimization
Multiobjective evolutionary algorithms (MOEAs) are major methods for solving multiobjective optimization problems (MOPs). Many MOEAs have been proposed in the past decades, of which the search operators need a carefully handcrafted design with domain knowledge. Recently, some attempts have been made to replace the manually designed operators in MOEAs with learning-based operators (e.g., neural network models). However, much effort is still required for designing and training such models, and the learned operators might not generalize well on new problems. To tackle the above challenges, this work investigates a novel approach that leverages the powerful large language model (LLM) to design MOEA operators. With proper prompt engineering, we successfully let a general LLM serve as a black-box search operator for decomposition-based MOEA (MOEA/D) in a zero-shot manner. In addition, by learning from the LLM behavior, we further design an explicit white-box operator with randomness and propose a new version of decomposition-based MOEA, termed MOEA/D-LO. Experimental studies on different test benchmarks show that our proposed method can achieve competitive performance with widely used MOEAs. It is also promising to see the operator only learned from a few instances can have robust generalization performance on unseen problems with quite different patterns and settings. The results reveal the potential benefits of using pre-trained LLMs in the design of MOEAs.To foster reproducibility and accessibility, the source code is https://github.com/FeiLiu36/LLM4MOEA.
♻ ☆ Electron-Tunnelling-Noise Programmable Random Variate Accelerator for Monte Carlo Sampling
This article presents an electron tunneling noise programmable random variate accelerator for accelerating the sampling stage of Monte Carlo simulations. We used the LiteX framework to generate a FemtoRV imfc RISC-V instruction set soft processor and deploy it on a Digilent Arty-100T FPGA development board. The RISC-V soft processor augmented with our programmable random variate accelerator achieves an average speedup of 8.70 times and a median speedup of 8.68 times for a suite of twelve different benchmark applications when compared to GNU Scientific Library software random number generation. These speedups are achievable because the benchmarks spend an average of 90.0 % of their execution time generating random samples. The results of the Monte Carlo benchmark programs run over the programmable random variate accelerator have an average Wasserstein distance of 1.48 times and a median Wasserstein distance of 1.41 times$that of the results produced by the GNU Scientific Library random number generators. The soft processor samples the electron tunneling noise source using the hardened XADC block in the FPGA. The flexibility of the LiteX framework allows for the deployment of any LiteX-supported soft processor with an electron tunneling noise programmable random variate accelerator on any LiteX-supported development board that contains an FPGA with an XADC.
Human-Computer Interaction
☆ Instantaneous Visual Analysis of Blood Flow in Stenoses Using Morphological Similarity
The emergence of computational fluid dynamics (CFD) enabled the simulation of intricate transport processes, including flow in physiological structures, such as blood vessels. While these so-called hemodynamic simulations offer groundbreaking opportunities to solve problems at the clinical forefront, a successful translation of CFD to clinical decision-making is challenging. Hemodynamic simulations are intrinsically complex, time-consuming, and resource-intensive, which conflicts with the time-sensitive nature of clinical workflows and the fact that hospitals usually do not have the necessary resources or infrastructure to support CFD simulations. To address these transfer challenges, we propose a novel visualization system which enables instant flow exploration without performing on-site simulation. To gain insights into the viability of the approach, we focus on hemodynamic simulations of the carotid bifurcation, which is a highly relevant arterial subtree in stroke diagnostics and prevention. We created an initial database of 120 high-resolution carotid bifurcation flow models and developed a set of similarity metrics used to place a new carotid surface model into a neighborhood of simulated cases with the highest geometric similarity. The neighborhood can be immediately explored and the flow fields analyzed. We found that if the artery models are similar enough in the regions of interest, a new simulation leads to coinciding results, allowing the user to circumvent individual flow simulations. We conclude that similarity-based visual analysis is a promising approach toward the usability of CFD in medical practice.
comment: 13 pages, Eurographics Conference on Visualization (EuroVis) 2024
☆ Virtual Co-Pilot: Multimodal Large Language Model-enabled Quick-access Procedures for Single Pilot Operations
Advancements in technology, pilot shortages, and cost pressures are driving a trend towards single-pilot and even remote operations in aviation. Considering the extensive workload and huge risks associated with single-pilot operations, the development of a Virtual Co-Pilot (V-CoP) is expected to be a potential way to ensure aviation safety. This study proposes a V-CoP concept and explores how humans and virtual assistants can effectively collaborate. A preliminary case study is conducted to explore a critical role of V-CoP, namely automated quick procedures searching, using the multimodal large language model (LLM). The LLM-enabled V-CoP integrates the pilot instruction and real-time cockpit instrumental data to prompt applicable aviation manuals and operation procedures. The results showed that the LLM-enabled V-CoP achieved high accuracy in situational analysis and effective retrieval of procedure information. The results showed that the LLM-enabled V-CoP achieved high accuracy in situational analysis (90.5%) and effective retrieval of procedure information (86.5%). The proposed V-CoP is expected to provide a foundation for future virtual intelligent assistant development, improve the performance of single pilots, and reduce the risk of human errors in aviation.
comment: 10 pages,7 figures
☆ Technical Development of a Semi-Autonomous Robotic Partition
This technical description details the design and engineering process of a semi-autonomous robotic partition. This robotic partition prototype was subsequently employed in a longer-term evaluation in-the-wild study conducted by the authors in a real-world office setting.
☆ Research Challenges for Adaptive Architecture: Empowering Occupants of Multi-Occupancy Buildings
This positional paper outlines our vision of 'adaptive architecture', which involves the integration of robotic technology to physically change an architectural space in supporting the changing needs of its occupants, in response to the CHI'24 workshop "HabiTech - Inhabiting Buildings, Data & Technology" call on "How do new technologies enable and empower the inhabitants of multi-occupancy buildings?". Specifically, while adaptive architecture holds promise for enhancing occupant satisfaction, comfort, and overall health and well-being, there remains a range of research challenges of (1) how it can effectively support individual occupants, while (2) mediating the conflicting needs of collocated others, and (3) integrating meaningfully into the sociocultural characteristics of their building community.
☆ The Adaptive Workplace: Orchestrating Architectural Services around the Wellbeing of Individual Occupants
As the academic consortia members of the EU Horizon project SONATA ("Situation-aware OrchestratioN of AdapTive Architecture"), we respond to the workshop call for "Office Wellbeing by Design: Don't Stand for Anything Less" by proposing the "Adaptive Workplace" concept. In essence, our vision aims to adapt a workplace to the ever-changing needs of individual occupants, instead of that occupants are expected to adapt to their workplace.
☆ Enhancing Cross-Dataset EEG Emotion Recognition: A Novel Approach with Emotional EEG Style Transfer Network
Recognizing the pivotal role of EEG emotion recognition in the development of affective Brain-Computer Interfaces (aBCIs), considerable research efforts have been dedicated to this field. While prior methods have demonstrated success in intra-subject EEG emotion recognition, a critical challenge persists in addressing the style mismatch between EEG signals from the source domain (training data) and the target domain (test data). To tackle the significant inter-domain differences in cross-dataset EEG emotion recognition, this paper introduces an innovative solution known as the Emotional EEG Style Transfer Network (E$^2$STN). The primary objective of this network is to effectively capture content information from the source domain and the style characteristics from the target domain, enabling the reconstruction of stylized EEG emotion representations. These representations prove highly beneficial in enhancing cross-dataset discriminative prediction. Concretely, E$^2$STN consists of three key modules\textemdash transfer module, transfer evaluation module, and discriminative prediction module\textemdash which address the domain style transfer, transfer quality evaluation, and discriminative prediction, respectively. Extensive experiments demonstrate that E$^2$STN achieves state-of-the-art performance in cross-dataset EEG emotion recognition tasks.
comment: 8 pages. arXiv admin note: substantial text overlap with arXiv:2308.05767
☆ Linguistically Differentiating Acts and Recalls of Racial Microaggressions on Social Media
In this work, we examine the linguistic signature of online racial microaggressions (acts) and how it differs from that of personal narratives recalling experiences of such aggressions (recalls) by Black social media users. We manually curate and annotate a corpus of acts and recalls from in-the-wild social media discussions, and verify labels with Black workshop participants. We leverage Natural Language Processing (NLP) and qualitative analysis on this data to classify (RQ1), interpret (RQ2), and characterize (RQ3) the language underlying acts and recalls of racial microaggressions in the context of racism in the U.S. Our findings show that neural language models (LMs) can classify acts and recalls with high accuracy (RQ1) with contextual words revealing themes that associate Blacks with objects that reify negative stereotypes (RQ2). Furthermore, overlapping linguistic signatures between acts and recalls serve functionally different purposes (RQ3), providing broader implications to the current challenges in content moderation systems on social media.
comment: 36 pages
☆ A Mixed Method Study of DevOps Challenges
Context: DevOps practices combine software development and IT operations. There is a growing number of DevOps related posts in popular online developer forum Stack Overflow (SO). While previous research analyzed SO posts related to build/release engineering, we are aware of no research that specifically focused on DevOps related discussions. Objective: To learn the challenges developers face while using the currently available DevOps tools and techniques along with the organizational challenges in DevOps practices. Method: We conduct an empirical study by applying topic modeling on 174K SO posts that contain DevOps discussions. We then validate and extend the empirical study findings with a survey of 21 professional DevOps practitioners. Results: We find that: (1) There are 23 DevOps topics grouped into four categories: Cloud & CI/CD Tools, Infrastructure as Code, Container & Orchestration, and Quality Assurance. (2) The topic category Cloud & CI/CD Tools contains the highest number of topics (10) which cover 48.6% of all questions in our dataset, followed by the category Infrastructure as Code (28.9%). (3) File management is the most popular topic followed by Jenkins Pipeline, while infrastructural Exception Handling and Jenkins Distributed Architecture are the most difficult topics (with least accepted answers). (4) In the survey, developers mention that it requires hands-on experience before current DevOps tools can be considered easy. They raised the needs for better documentation and learning resources to learn the rapidly changing DevOps tools and techniques. Practitioners also emphasized on the formal training approach by the organizations for DevOps skill development. Conclusion: Architects and managers can use the findings of this research to adopt appropriate DevOps technologies, and organizations can design tool or process specific DevOps training programs.
☆ Development of a Chinese Human-Automation Trust Scale
The development of a reliable and valid assessment tool of human-automation trust is an important topic. This study aimed to develop a Chinese version of human-automation trust scale (C-HATS) with reasonable reliability and validity based on Lee and See (2004)'s trust model. After three phases of assessments including exploratory factor analysis, item analysis, and confirmatory factor analysis, different dimensions and items were considered for initial and posttask human-automation trust. For post-task trust, the scale had three dimensions and 11 items and reflected Lee and See (2004)'s model, whereas different from Lee and See (2004)'s model, the final scale had 14 items but only two dimensions for initial trust. Nevertheless, for both initial and post-task trust, reasonable reliability and validity of the scale were verified with various consumer automation products. Although further verification is still necessary, the developed C-HATS could be used to effectively assess human-automation trust in the Chinese context.
comment: 26 pages with 3 figures
☆ Human Stress Response and Perceived Safety during Encounters with Quadruped Robots
Despite the rise of mobile robot deployments in home and work settings, perceived safety of users and bystanders is understudied in the human-robot interaction (HRI) literature. To address this, we present a study designed to identify elements of a human-robot encounter that correlate with observed stress response. Stress is a key component of perceived safety and is strongly associated with human physiological response. In this study a Boston Dynamics Spot and a Unitree Go1 navigate autonomously through a shared environment occupied by human participants wearing multimodal physiological sensors to track their electrocardiography (ECG) and electrodermal activity (EDA). The encounters are varied through several trials and participants self-rate their stress levels after each encounter. The study resulted in a multidimensional dataset archiving various objective and subjective aspects of a human-robot encounter, containing insights for understanding perceived safety in such encounters. To this end, acute stress responses were decoded from the human participants' ECG and EDA and compared across different human-robot encounter conditions. Statistical analysis of data indicate that on average (1) participants feel more stress during encounters compared to baselines, (2) participants feel more stress encountering multiple robots compared to a single robot and (3) participants stress increases during navigation behavior compared with search behavior.
comment: 7 pages, 7 figs, 5 tables
☆ EXPLORA: A teacher-apprentice methodology for eliciting natural child-computer interactions
Investigating child-computer interactions within their contexts is vital for designing technology that caters to children's needs. However, determining what aspects of context are relevant for designing child-centric technology remains a challenge. We introduce EXPLORA, a multimodal, multistage online methodology comprising three pivotal stages: (1) building a teacher-apprentice relationship,(2) learning from child-teachers, and (3) assessing and reinforcing researcher-apprentice learning. Central to EXPLORA is the collection of attitudinal data through pre-observation interviews, offering researchers a deeper understanding of children's characteristics and contexts. This informs subsequent online observations, allowing researchers to focus on frequent interactions. Furthermore, researchers can validate preliminary assumptions with children. A means-ends analysis framework aids in the systematic analysis of data, shedding light on context, agency and homework-information searching processes children employ in their activities. To illustrate EXPLORA's capabilities, we present nine single case studies investigating Brazilian child-caregiver dyads' (children ages 9-11) use of technology in homework information-searching.
☆ Measuring Compliance with the California Consumer Privacy Act Over Space and Time
The widespread sharing of consumers personal information with third parties raises significant privacy concerns. The California Consumer Privacy Act (CCPA) mandates that online businesses offer consumers the option to opt out of the sale and sharing of personal information. Our study automatically tracks the presence of the opt-out link longitudinally across multiple states after the California Privacy Rights Act (CPRA) went into effect. We categorize websites based on whether they are subject to CCPA and investigate cases of potential non-compliance. We find a number of websites that implement the opt-out link early and across all examined states but also find a significant number of CCPA-subject websites that fail to offer any opt-out methods even when CCPA is in effect. Our findings can shed light on how websites are reacting to the CCPA and identify potential gaps in compliance and opt- out method designs that hinder consumers from exercising CCPA opt-out rights.
☆ SeSaMe: A Framework to Simulate Self-Reported Ground Truth for Mental Health Sensing Studies
Advances in mobile and wearable technologies have enabled the potential to passively monitor a person's mental, behavioral, and affective health. These approaches typically rely on longitudinal collection of self-reported outcomes, e.g., depression, stress, and anxiety, to train machine learning (ML) models. However, the need to continuously self-report adds a significant burden on the participants, often resulting in attrition, missing labels, or insincere responses. In this work, we introduce the Scale Scores Simulation using Mental Models (SeSaMe) framework to alleviate participants' burden in digital mental health studies. By leveraging pre-trained large language models (LLMs), SeSaMe enables the simulation of participants' responses on psychological scales. In SeSaMe, researchers can prompt LLMs with information on participants' internal behavioral dispositions, enabling LLMs to construct mental models of participants to simulate their responses on psychological scales. We demonstrate an application of SeSaMe, where we use GPT-4 to simulate responses on one scale using responses from another as behavioral information. We also evaluate the alignment between human and SeSaMe-simulated responses to psychological scales. Then, we present experiments to inspect the utility of SeSaMe-simulated responses as ground truth in training ML models by replicating established depression and anxiety screening tasks from a previous study. Our results indicate SeSaMe to be a promising approach, but its alignment may vary across scales and specific prediction objectives. We also observed that model performance with simulated data was on par with using the real data for training in most evaluation scenarios. We conclude by discussing the potential implications of SeSaMe in addressing some challenges researchers face with ground-truth collection in passive sensing studies.
☆ Building an Open-Source Community to Enhance Autonomic Nervous System Signal Analysis: DBDP-Autonomic
Smartphones and wearable sensors offer an unprecedented ability to collect peripheral psychophysiological signals across diverse timescales, settings, populations, and modalities. However, open-source software development has yet to keep pace with rapid advancements in hardware technology and availability, creating an analytical barrier that limits the scientific usefulness of acquired data. We propose a community-driven, open-source peripheral psychophysiological signal pre-processing and analysis software framework that could advance biobehavioral health by enabling more robust, transparent, and reproducible inferences involving autonomic nervous system data.
☆ Behind the Counter: Exploring the Motivations and Barriers of Online Counterspeech Writing
Current research mainly explores the attributes and impact of online counterspeech, leaving a gap in understanding of who engages in online counterspeech or what motivates or deters users from participating. To investigate this, we surveyed 458 English-speaking U.S. participants, analyzing key motivations and barriers underlying online counterspeech engagement. We presented each participant with three hate speech examples from a set of 900, spanning race, gender, religion, sexual orientation, and disability, and requested counterspeech responses. Subsequent questions assessed their satisfaction, perceived difficulty, and the effectiveness of their counterspeech. Our findings show that having been a target of online hate is a key driver of frequent online counterspeech engagement. People differ in their motivations and barriers towards engaging in online counterspeech across different demographic groups. Younger individuals, women, those with higher education levels, and regular witnesses to online hate are more reluctant to engage in online counterspeech due to concerns around public exposure, retaliation, and third-party harassment. Varying motivation and barriers in counterspeech engagement also shape how individuals view their own self-authored counterspeech and the difficulty experienced writing it. Additionally, our work explores people's willingness to use AI technologies like ChatGPT for counterspeech writing. Through this work we introduce a multi-item scale for understanding counterspeech motivation and barriers and a more nuanced understanding of the factors shaping online counterspeech engagement.
comment: 39 pages, 2 figures
☆ GOLF: Goal-Oriented Long-term liFe tasks supported by human-AI collaboration
The advent of ChatGPT and similar large language models (LLMs) has revolutionized the human-AI interaction and information-seeking process. Leveraging LLMs as an alternative to search engines, users can now access summarized information tailored to their queries, significantly reducing the cognitive load associated with navigating vast information resources. This shift underscores the potential of LLMs in redefining information access paradigms. Drawing on the foundation of task-focused information retrieval and LLMs' task planning ability, this research extends the scope of LLM capabilities beyond routine task automation to support users in navigating long-term and significant life tasks. It introduces the GOLF framework (Goal-Oriented Long-term liFe tasks), which focuses on enhancing LLMs' ability to assist in significant life decisions through goal orientation and long-term planning. The methodology encompasses a comprehensive simulation study to test the framework's efficacy, followed by model and human evaluations to develop a dataset benchmark for long-term life tasks, and experiments across different models and settings. By shifting the focus from short-term tasks to the broader spectrum of long-term life goals, this research underscores the transformative potential of LLMs in enhancing human decision-making processes and task management, marking a significant step forward in the evolution of human-AI collaboration.
☆ "It is there, and you need it, so why do you not use it?" Achieving better adoption of AI systems by domain experts, in the case study of natural science research
Artificial Intelligence (AI) is becoming ubiquitous in domains such as medicine and natural science research. However, when AI systems are implemented in practice, domain experts often refuse them. Low acceptance hinders effective human-AI collaboration, even when it is essential for progress. In natural science research, scientists' ineffective use of AI-enabled systems can impede them from analysing their data and advancing their research. We conducted an ethnographically informed study of 10 in-depth interviews with AI practitioners and natural scientists at the organisation facing low adoption of algorithmic systems. Results were consolidated into recommendations for better AI adoption: i) actively supporting experts during the initial stages of system use, ii) communicating the capabilities of a system in a user-relevant way, and iii) following predefined collaboration rules. We discuss the broader implications of our findings and expand on how our proposed requirements could support practitioners and experts across domains.
☆ Compressing and Interpreting Word Embeddings with Latent Space Regularization and Interactive Semantics Probing
Word embedding, a high-dimensional (HD) numerical representation of words generated by machine learning models, has been used for different natural language processing tasks, e.g., translation between two languages. Recently, there has been an increasing trend of transforming the HD embeddings into a latent space (e.g., via autoencoders) for further tasks, exploiting various merits the latent representations could bring. To preserve the embeddings' quality, these works often map the embeddings into an even higher-dimensional latent space, making the already complicated embeddings even less interpretable and consuming more storage space. In this work, we borrow the idea of $\beta$VAE to regularize the HD latent space. Our regularization implicitly condenses information from the HD latent space into a much lower-dimensional space, thus compressing the embeddings. We also show that each dimension of our regularized latent space is more semantically salient, and validate our assertion by interactively probing the encoding-level of user-proposed semantics in the dimensions. To the end, we design a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions' semantics. We validate the effectiveness of our embedding regularization and interpretation approach through both quantitative and qualitative evaluations.
☆ Towards Human-AI Deliberation: Design and Evaluation of LLM-Empowered Deliberative AI for AI-Assisted Decision-Making
In AI-assisted decision-making, humans often passively review AI's suggestion and decide whether to accept or reject it as a whole. In such a paradigm, humans are found to rarely trigger analytical thinking and face difficulties in communicating the nuances of conflicting opinions to the AI when disagreements occur. To tackle this challenge, we propose Human-AI Deliberation, a novel framework to promote human reflection and discussion on conflicting human-AI opinions in decision-making. Based on theories in human deliberation, this framework engages humans and AI in dimension-level opinion elicitation, deliberative discussion, and decision updates. To empower AI with deliberative capabilities, we designed Deliberative AI, which leverages large language models (LLMs) as a bridge between humans and domain-specific models to enable flexible conversational interactions and faithful information provision. An exploratory evaluation on a graduate admissions task shows that Deliberative AI outperforms conventional explainable AI (XAI) assistants in improving humans' appropriate reliance and task performance. Based on a mixed-methods analysis of participant behavior, perception, user experience, and open-ended feedback, we draw implications for future AI-assisted decision tool design.
☆ "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning
Organizations rely on machine learning engineers (MLEs) to deploy models and maintain ML pipelines in production. Due to models' extensive reliance on fresh data, the operationalization of machine learning, or MLOps, requires MLEs to have proficiency in data science and engineering. When considered holistically, the job seems staggering -- how do MLEs do MLOps, and what are their unaddressed challenges? To address these questions, we conducted semi-structured ethnographic interviews with 18 MLEs working on various applications, including chatbots, autonomous vehicles, and finance. We find that MLEs engage in a workflow of (i) data preparation, (ii) experimentation, (iii) evaluation throughout a multi-staged deployment, and (iv) continual monitoring and response. Throughout this workflow, MLEs collaborate extensively with data scientists, product stakeholders, and one another, supplementing routine verbal exchanges with communication tools ranging from Slack to organization-wide ticketing and reporting systems. We introduce the 3Vs of MLOps: velocity, visibility, and versioning -- three virtues of successful ML deployments that MLEs learn to balance and grow as they mature. Finally, we discuss design implications and opportunities for future work.
comment: arXiv admin note: text overlap with arXiv:2209.09125
♻ ☆ Towards Massive Interaction with Generalist Robotics: A Systematic Review of XR-enabled Remote Human-Robot Interaction Systems
This survey provides an exhaustive review of the applications of extended reality (XR) technologies in the field of remote human-computer interaction (HRI). We developed a systematic search strategy based on the PRISMA methodology. From the initial 2,561 articles selected, 100 research papers that met our inclusion criteria were included. We categorized and summarized the domain in detail, delving into XR technologies, including augmented reality (AR), virtual reality (VR), and mixed reality (MR), and their applications in facilitating intuitive and effective remote control and interaction with robotic systems.The survey highlights existing articles on the application of XR technologies, user experience enhancement, and various interaction designs for XR in remote HRI, providing insights into current trends and future directions. We also identified potential gaps and opportunities for future research to improve remote HRI systems through XR technology to guide and inform future XR and robotics research.
♻ ☆ From Words and Exercises to Wellness: Farsi Chatbot for Self-Attachment Technique
In the wake of the post-pandemic era, marked by social isolation and surging rates of depression and anxiety, conversational agents based on digital psychotherapy can play an influential role compared to traditional therapy sessions. In this work, we develop a voice-capable chatbot in Farsi to guide users through Self-Attachment (SAT), a novel, self-administered, holistic psychological technique based on attachment theory. Our chatbot uses a dynamic array of rule-based and classification-based modules to comprehend user input throughout the conversation and navigates a dialogue flowchart accordingly, recommending appropriate SAT exercises that depend on the user's emotional and mental state. In particular, we collect a dataset of over 6,000 utterances and develop a novel sentiment-analysis module that classifies user sentiment into 12 classes, with accuracy above 92%. To keep the conversation novel and engaging, the chatbot's responses are retrieved from a large dataset of utterances created with the aid of Farsi GPT-2 and a reinforcement learning approach, thus requiring minimal human annotation. Our chatbot also offers a question-answering module, called SAT Teacher, to answer users' questions about the principles of Self-Attachment. Finally, we design a cross-platform application as the bot's user interface. We evaluate our platform in a ten-day human study with N=52 volunteers from the non-clinical population, who have had over 2,000 dialogues in total with the chatbot. The results indicate that the platform was engaging to most users (75%), 72% felt better after the interactions, and 74% were satisfied with the SAT Teacher's performance.
♻ ☆ Can AI and humans genuinely communicate?
Can AI and humans genuinely communicate? In this article, after giving some background and motivating my proposal (sections 1 to 3), I explore a way to answer this question that I call the "mental-behavioral methodology" (sections 4 and 5). This methodology follows the following three steps: First, spell out what mental capacities are sufficient for human communication (as opposed to communication more generally). Second, spell out the experimental paradigms required to test whether a behavior exhibits these capacities. Third, apply or adapt these paradigms to test whether an AI displays the relevant behaviors. If the first two steps are successfully completed, and if the AI passes the tests with human-like results, this constitutes evidence that this AI and humans can genuinely communicate. This mental-behavioral methodology has the advantage that we don't need to understand the workings of black-box algorithms, such as standard deep neural networks. This is comparable to the fact that we don't need to understand how human brains work to know that humans can genuinely communicate. This methodology also has its disadvantages and I will discuss some of them (section 6).
comment: March 2024 preprint
♻ ☆ Inferring Human Intentions from Predicted Action Probabilities CHI 2024
Predicting the next action that a human is most likely to perform is key to human-AI collaboration and has consequently attracted increasing research interests in recent years. An important factor for next action prediction are human intentions: If the AI agent knows the intention it can predict future actions and plan collaboration more effectively. Existing Bayesian methods for this task struggle with complex visual input while deep neural network (DNN) based methods do not provide uncertainty quantifications. In this work we combine both approaches for the first time and show that the predicted next action probabilities contain information that can be used to infer the underlying intention. We propose a two-step approach to human intention prediction: While a DNN predicts the probabilities of the next action, MCMC-based Bayesian inference is used to infer the underlying intention from these predictions. This approach not only allows for independent design of the DNN architecture but also the subsequently fast, design-independent inference of human intentions. We evaluate our method using a series of experiments on the Watch-And-Help (WAH) and a keyboard and mouse interaction dataset. Our results show that our approach can accurately predict human intentions from observed actions and the implicit information contained in next action probabilities. Furthermore, we show that our approach can predict the correct intention even if only few actions have been observed.
comment: Accepted by Workshop on Theory of Mind in Human-AI Interaction at CHI 2024
♻ ☆ Mind Meets Robots: A Review of EEG-Based Brain-Robot Interaction Systems
Brain-robot interaction (BRI) empowers individuals to control (semi-)automated machines through their brain activity, either passively or actively. In the past decade, BRI systems have achieved remarkable success, predominantly harnessing electroencephalogram (EEG) signals as the central component. This paper offers an up-to-date and exhaustive examination of 87 curated studies published during the last five years (2018-2023), focusing on identifying the research landscape of EEG-based BRI systems. This review aims to consolidate and underscore methodologies, interaction modes, application contexts, system evaluation, existing challenges, and potential avenues for future investigations in this domain. Based on our analysis, we present a BRI system model with three entities: Brain, Robot, and Interaction, depicting the internal relationships of a BRI system. We especially investigate the essence and principles on interaction modes between human brains and robots, a domain that has not yet been identified anywhere. We then discuss these entities with different dimensions encompassed. Within this model, we scrutinize and classify current research, reveal insights, specify challenges, and provide recommendations for future research trajectories in this field. Meanwhile, we envision our findings offer a design space for future human-robot interaction (HRI) research, informing the creation of efficient BRI frameworks.
♻ ☆ Mapping the Landscape of Independent Food Delivery Platforms in the United States SC
Beyond the well-known giants like Uber Eats and DoorDash, there are hundreds of independent food delivery platforms in the United States. However, little is known about the sociotechnical landscape of these ``indie'' platforms. In this paper, we analyzed these platforms to understand why they were created, how they operate, and what technologies they use. We collected data on 495 indie platforms and detailed survey responses from 29 platforms. We found that personalized, timely service is a central value of indie platforms, as is a sense of responsibility to the local community they serve. Indie platforms are motivated to provide fair rates for restaurants and couriers. These alternative business practices differentiate them from mainstream platforms. Though indie platforms have plans to expand, a lack of customizability in off-the-shelf software prevents independent platforms from personalizing services for their local communities. We show that these platforms are a widespread and longstanding fixture of the food delivery market. We illustrate the diversity of motivations and values to explain why a one-size-fits-all support is insufficient, and we discuss the siloing of technology that inhibits platforms' growth. Through these insights, we aim to promote future HCI research into the potential development of public-interest technologies for local food delivery.
comment: To appear in CSCW 2024
♻ ☆ Chart4Blind: An Intelligent Interface for Chart Accessibility Conversion
In a world driven by data visualization, ensuring the inclusive accessibility of charts for Blind and Visually Impaired (BVI) individuals remains a significant challenge. Charts are usually presented as raster graphics without textual and visual metadata needed for an equivalent exploration experience for BVI people. Additionally, converting these charts into accessible formats requires considerable effort from sighted individuals. Digitizing charts with metadata extraction is just one aspect of the issue; transforming it into accessible modalities, such as tactile graphics, presents another difficulty. To address these disparities, we propose Chart4Blind, an intelligent user interface that converts bitmap image representations of line charts into universally accessible formats. Chart4Blind achieves this transformation by generating Scalable Vector Graphics (SVG), Comma-Separated Values (CSV), and alternative text exports, all comply with established accessibility standards. Through interviews and a formal user study, we demonstrate that even inexperienced sighted users can make charts accessible in an average of 4 minutes using Chart4Blind, achieving a System Usability Scale rating of 90%. In comparison to existing approaches, Chart4Blind provides a comprehensive solution, generating end-to-end accessible SVGs suitable for assistive technologies such as embossed prints (papers and laser cut), 2D tactile displays, and screen readers. For additional information, including open-source codes and demos, please visit our project page https://moured.github.io/chart4blind/.
comment: Accepted to IUI 2024. 19 pages, 7 figures, 2 table. For a demo video, see this https://moured.github.io/chart4blind/ . The source code is available at https://github.com/moured/chart4blind_code/
♻ ☆ Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback ICLR 2024
Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.
comment: Published as a conference paper at ICLR 2024. The website is available at https://uni-rlhf.github.io/
Networking and Internet Architecture
☆ Symbolic and User-friendly Geometric Algebra Routines (SUGAR) for Computations in Matlab
Geometric algebra (GA) is a mathematical tool for geometric computing, providing a framework that allows a unified and compact approach to geometric relations which in other mathematical systems are typically described using different more complicated elements. This fact has led to an increasing adoption of GA in applied mathematics and engineering problems. However, the scarcity of symbolic implementations of GA and its inherent complexity, requiring a specific mathematical background, make it challenging and less intuitive for engineers to work with. This prevents wider adoption among more applied professionals. To address this challenge, this paper introduces SUGAR (Symbolic and User-friendly Geometric Algebra Routines), an open-source toolbox designed for Matlab and licensed under the MIT License. SUGAR facilitates the translation of GA concepts into Matlab and provides a collection of user-friendly functions tailored for GA computations, including support for symbolic operations. It supports both numeric and symbolic computations in high-dimensional GAs. Specifically tailored for applied mathematics and engineering applications, SUGAR has been meticulously engineered to represent geometric elements and transformations within two and three-dimensional projective and conformal geometric algebras, aligning with established computational methodologies in the literature. Furthermore, SUGAR efficiently handles functions of multivectors, such as exponential, logarithmic, sinusoidal, and cosine functions, enhancing its applicability across various engineering domains, including robotics, control systems, and power electronics. Finally, this work includes four distinct validation examples, demonstrating SUGAR's capabilities across the above-mentioned fields and its practical utility in addressing real-world applied mathematics and engineering problems.
comment: 33 pages, 6 figures, journal paper submitted to ACM TOMS
Social and Information Networks
☆ Who is bragging more online? A large scale analysis of bragging in social media LREC
Bragging is the act of uttering statements that are likely to be positively viewed by others and it is extensively employed in human communication with the aim to build a positive self-image of oneself. Social media is a natural platform for users to employ bragging in order to gain admiration, respect, attention and followers from their audiences. Yet, little is known about the scale of bragging online and its characteristics. This paper employs computational sociolinguistics methods to conduct the first large scale study of bragging behavior on Twitter (U.S.) by focusing on its overall prevalence, temporal dynamics and impact of demographic factors. Our study shows that the prevalence of bragging decreases over time within the same population of users. In addition, younger, more educated and popular users in the U.S. are more likely to brag. Finally, we conduct an extensive linguistics analysis to unveil specific bragging themes associated with different user traits.
comment: Accepted at LREC-COLING 2024
☆ Linguistically Differentiating Acts and Recalls of Racial Microaggressions on Social Media
In this work, we examine the linguistic signature of online racial microaggressions (acts) and how it differs from that of personal narratives recalling experiences of such aggressions (recalls) by Black social media users. We manually curate and annotate a corpus of acts and recalls from in-the-wild social media discussions, and verify labels with Black workshop participants. We leverage Natural Language Processing (NLP) and qualitative analysis on this data to classify (RQ1), interpret (RQ2), and characterize (RQ3) the language underlying acts and recalls of racial microaggressions in the context of racism in the U.S. Our findings show that neural language models (LMs) can classify acts and recalls with high accuracy (RQ1) with contextual words revealing themes that associate Blacks with objects that reify negative stereotypes (RQ2). Furthermore, overlapping linguistic signatures between acts and recalls serve functionally different purposes (RQ3), providing broader implications to the current challenges in content moderation systems on social media.
comment: 36 pages
☆ LSTTN: A Long-Short Term Transformer-based Spatio-temporal Neural Network for Traffic Flow Forecasting
Accurate traffic forecasting is a fundamental problem in intelligent transportation systems and learning long-range traffic representations with key information through spatiotemporal graph neural networks (STGNNs) is a basic assumption of current traffic flow prediction models. However, due to structural limitations, existing STGNNs can only utilize short-range traffic flow data; therefore, the models cannot adequately learn the complex trends and periodic features in traffic flow. Besides, it is challenging to extract the key temporal information from the long historical traffic series and obtain a compact representation. To solve the above problems, we propose a novel LSTTN (Long-Short Term Transformer-based Network) framework comprehensively considering the long- and short-term features in historical traffic flow. First, we employ a masked subseries Transformer to infer the content of masked subseries from a small portion of unmasked subseries and their temporal context in a pretraining manner, forcing the model to efficiently learn compressed and contextual subseries temporal representations from long historical series. Then, based on the learned representations, long-term trend is extracted by using stacked 1D dilated convolution layers, and periodic features are extracted by dynamic graph convolution layers. For the difficulties in making time-step level prediction, LSTTN adopts a short-term trend extractor to learn fine-grained short-term temporal features. Finally, LSTTN fuses the long-term trend, periodic features and short-term features to obtain the prediction results. Experiments on four real-world datasets show that in 60-minute-ahead long-term forecasting, the LSTTN model achieves a minimum improvement of 5.63\% and a maximum improvement of 16.78\% over baseline models. The source code is available at https://github.com/GeoX-Lab/LSTTN.
comment: 15 pages, 10 figures, 6 tables
☆ Diffusion-based Negative Sampling on Graphs for Link Prediction
Link prediction is a fundamental task for graph analysis with important applications on the Web, such as social network analysis and recommendation systems, etc. Modern graph link prediction methods often employ a contrastive approach to learn robust node representations, where negative sampling is pivotal. Typical negative sampling methods aim to retrieve hard examples based on either predefined heuristics or automatic adversarial approaches, which might be inflexible or difficult to control. Furthermore, in the context of link prediction, most previous methods sample negative nodes from existing substructures of the graph, missing out on potentially more optimal samples in the latent space. To address these issues, we investigate a novel strategy of multi-level negative sampling that enables negative node generation with flexible and controllable ``hardness'' levels from the latent space. Our method, called Conditional Diffusion-based Multi-level Negative Sampling (DMNS), leverages the Markov chain property of diffusion models to generate negative nodes in multiple levels of variable hardness and reconcile them for effective graph link prediction. We further demonstrate that DMNS follows the sub-linear positivity principle for robust negative sampling. Extensive experiments on several benchmark datasets demonstrate the effectiveness of DMNS.
comment: Accepted in the TheWebConf 2024
Review Ecosystems to access Educational XR Experiences: a Scoping Review
Educators, developers, and other stakeholders face challenges when creating, adapting, and utilizing virtual and augmented reality (XR) experiences for teaching curriculum topics. User created reviews of these applications provide important information about their relevance and effectiveness in supporting achievement of educational outcomes. To make these reviews accessible, relevant, and useful, they must be readily available and presented in a format that supports decision-making by educators. This paper identifies best practices for developing a new review ecosystem by analyzing existing approaches to providing reviews of interactive experiences. It focuses on the form and format of these reviews, as well as the mechanisms for sharing information about experiences and identifying which ones are most effective. The paper also examines the incentives that drive review creation and maintenance, ensuring that new experiences receive attention from reviewers and that relevant information is updated when necessary. The strategies and opportunities for developing an educational XR (eduXR) review ecosystem include methods for measuring properties such as quality metrics, engaging a broad range of stakeholders in the review process, and structuring the system as a closed loop managed by feedback and incentive structures to ensure stability and productivity. Computing educators are well-positioned to lead the development of these review ecosystems, which can relate XR experiences to the potential opportunities for teaching and learning that they offer.
comment: 29 pages, 13 figures
☆ Manufacturing Service Capability Prediction with Graph Neural Networks
In the current landscape, the predominant methods for identifying manufacturing capabilities from manufacturers rely heavily on keyword matching and semantic matching. However, these methods often fall short by either overlooking valuable hidden information or misinterpreting critical data. Consequently, such approaches result in an incomplete identification of manufacturers' capabilities. This underscores the pressing need for data-driven solutions to enhance the accuracy and completeness of manufacturing capability identification. To address the need, this study proposes a Graph Neural Network-based method for manufacturing service capability identification over a knowledge graph. To enhance the identification performance, this work introduces a novel approach that involves aggregating information from the graph nodes' neighborhoods as well as oversampling the graph data, which can be effectively applied across a wide range of practical scenarios. Evaluations conducted on a Manufacturing Service Knowledge Graph and subsequent ablation studies demonstrate the efficacy and robustness of the proposed approach. This study not only contributes a innovative method for inferring manufacturing service capabilities but also significantly augments the quality of Manufacturing Service Knowledge Graphs.
☆ Constant-selection evolutionary dynamics on weighted networks
The population structure often impacts evolutionary dynamics. In constant-selection evolutionary dynamics between two types, amplifiers of selection are networks that promote the fitter mutant to take over the entire population, and suppressors of selection do the opposite. It has been shown that most undirected and unweighted networks are amplifiers of selection under a common updating rule and initial condition. Here, we extensively investigate how edge weights influence selection on undirected networks. We show that random edge weights make small networks less amplifying than the corresponding unweighted networks in a majority of cases and also make them suppressors of selection (i.e., less amplifying than the complete graph, or equivalently, the Moran process) in many cases. Qualitatively, the same result holds true for larger empirical networks. These results suggest that amplifiers of selection are not as common for weighted networks as for unweighted counterparts.
☆ Belief Samples Are All You Need For Social Learning
In this paper, we consider the problem of social learning, where a group of agents embedded in a social network are interested in learning an underlying state of the world. Agents have incomplete, noisy, and heterogeneous sources of information, providing them with recurring private observations of the underlying state of the world. Agents can share their learning experience with their peers by taking actions observable to them, with values from a finite feasible set of states. Actions can be interpreted as samples from the beliefs which agents may form and update on what the true state of the world is. Sharing samples, in place of full beliefs, is motivated by the limited communication, cognitive, and information-processing resources available to agents especially in large populations. Previous work (Salhab et al.) poses the question as to whether learning with probability one is still achievable if agents are only allowed to communicate samples from their beliefs. We provide a definite positive answer to this question, assuming a strongly connected network and a ``collective distinguishability'' assumption, which are both required for learning even in full-belief-sharing settings. In our proposed belief update mechanism, each agent's belief is a normalized weighted geometric interpolation between a fully Bayesian private belief -- aggregating information from the private source -- and an ensemble of empirical distributions of the samples shared by her neighbors over time. By carefully constructing asymptotic almost-sure lower/upper bounds on the frequency of shared samples matching the true state/or not, we rigorously prove the convergence of all the beliefs to the true state, with probability one.
comment: 6 pages
☆ Dynamics of Affective Polarization: From Consensus to Partisan Divides
Politically divided societies are also often divided emotionally: people like and trust those with similar political views (in-group favoritism) while disliking and distrusting those with different views (out-group animosity). This phenomenon, called affective polarization, influences individual decisions, including seemingly apolitical choices such as whether to wear a mask or what car to buy. We present a dynamical model of decision-making in an affectively polarized society, identifying three potential global outcomes separated by a sharp boundary in the parameter space: consensus, partisan polarization, and non-partisan polarization. Analysis reveals that larger out-group animosity compared to in-group favoritism, i.e. more hate than love, is sufficient for polarization, while larger in-group favoritism compared to out-group animosity, i.e., more love than hate, is necessary for consensus. We also show that, counter-intuitively, increasing cross-party connections facilitates polarization, and that by emphasizing partisan differences, mass media creates self-fulfilling prophecies that lead to polarization. Affective polarization also creates tipping points in the opinion landscape where one group suddenly reverses their trends. Our findings aid in understanding and addressing the cascading effects of affective polarization, offering insights for strategies to mitigate polarization.
♻ ☆ On the resilience of Collaborative Learning-based Recommender Systems Against Community Detection Attack
Collaborative-learning-based recommender systems emerged following the success of collaborative learning techniques such as Federated Learning (FL) and Gossip Learning (GL). In these systems, users participate in the training of a recommender system while maintaining their history of consumed items on their devices. While these solutions seemed appealing for preserving the privacy of the participants at first glance, recent studies have revealed that collaborative learning can be vulnerable to various privacy attacks. In this paper, we study the resilience of collaborative learning-based recommender systems against a novel privacy attack called Community Detection Attack (CDA). This attack enables an adversary to identify community members based on a chosen set of items (eg., identifying users interested in specific points-of-interest). Through experiments on three real recommendation datasets using two state-of-the-art recommendation models, we evaluate the sensitivity of an FL-based recommender system as well as two flavors of Gossip Learning-based recommender systems to CDA. The results show that across all models and datasets, the FL setting is more vulnerable to CDA compared to Gossip settings. Furthermore, we assess two off-the-shelf mitigation strategies, namely differential privacy (DP) and a \emph{Share less} policy, which consists of sharing a subset of less sensitive model parameters. The findings indicate a more favorable privacy-utility trade-off for the \emph{Share less} strategy, particularly in FedRecs.
♻ ☆ Spectral clustering in the Gaussian mixture block model
Gaussian mixture block models are distributions over graphs that strive to model modern networks: to generate a graph from such a model, we associate each vertex $i$ with a latent feature vector $u_i \in \mathbb{R}^d$ sampled from a mixture of Gaussians, and we add edge $(i,j)$ if and only if the feature vectors are sufficiently similar, in that $\langle u_i,u_j \rangle \ge \tau$ for a pre-specified threshold $\tau$. The different components of the Gaussian mixture represent the fact that there may be different types of nodes with different distributions over features -- for example, in a social network each component represents the different attributes of a distinct community. Natural algorithmic tasks associated with these networks are embedding (recovering the latent feature vectors) and clustering (grouping nodes by their mixture component). In this paper we initiate the study of clustering and embedding graphs sampled from high-dimensional Gaussian mixture block models, where the dimension of the latent feature vectors $d\to \infty$ as the size of the network $n \to \infty$. This high-dimensional setting is most appropriate in the context of modern networks, in which we think of the latent feature space as being high-dimensional. We analyze the performance of canonical spectral clustering and embedding algorithms for such graphs in the case of 2-component spherical Gaussian mixtures, and begin to sketch out the information-computation landscape for clustering and embedding in these models.
comment: 48 pages
Distributed, Parallel, and Cluster Computing
☆ FOOL: Addressing the Downlink Bottleneck in Satellite Computing with Neural Feature Compression
Nanosatellite constellations equipped with sensors capturing large geographic regions provide unprecedented opportunities for Earth observation. As constellation sizes increase, network contention poses a downlink bottleneck. Orbital Edge Computing (OEC) leverages limited onboard compute resources to reduce transfer costs by processing the raw captures at the source. However, current solutions have limited practicability due to reliance on crude filtering methods or over-prioritizing particular downstream tasks. This work presents FOOL, an OEC-native and task-agnostic feature compression method that preserves prediction performance. FOOL partitions high-resolution satellite imagery to maximize throughput. Further, it embeds context and leverages inter-tile dependencies to lower transfer costs with negligible overhead. While FOOL is a feature compressor, it can recover images with competitive scores on perceptual quality measures at lower bitrates. We extensively evaluate transfer cost reduction by including the peculiarity of intermittently available network connections in low earth orbit. Lastly, we test the feasibility of our system for standardized nanosatellite form factors. We demonstrate that FOOL permits downlinking over 100x the data volume without relying on prior information on the downstream tasks.
comment: 18 pages, double column, 19 figures, 7 tables, Initial Submission to IEEE Transactions on Mobile Computing
☆ Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients
Federated Learning (FL) is a distributed machine learning framework in communication network systems. However, the systems' Non-Independent and Identically Distributed (Non-IID) data negatively affect the convergence efficiency of the global model, since only a subset of these data samples are beneficial for model convergence. In pursuit of this subset, a reliable approach involves determining a measure of validity to rank the samples within the dataset. In this paper, We propose the BHerd strategy which selects a beneficial herd of local gradients to accelerate the convergence of the FL model. Specifically, we map the distribution of the local dataset to the local gradients and use the Herding strategy to obtain a permutation of the set of gradients, where the more advanced gradients in the permutation are closer to the average of the set of gradients. These top portion of the gradients will be selected and sent to the server for global aggregation. We conduct experiments on different datasets, models and scenarios by building a prototype system, and experimental results demonstrate that our BHerd strategy is effective in selecting beneficial local gradients to mitigate the effects brought by the Non-IID dataset, thus accelerating model convergence.
☆ Differentially Private Online Federated Learning with Correlated Noise
We propose a novel differentially private algorithm for online federated learning that employs temporally correlated noise to improve the utility while ensuring the privacy of the continuously released models. To address challenges stemming from DP noise and local updates with streaming noniid data, we develop a perturbed iterate analysis to control the impact of the DP noise on the utility. Moreover, we demonstrate how the drift errors from local updates can be effectively managed under a quasi-strong convexity condition. Subject to an $(\epsilon, \delta)$-DP budget, we establish a dynamic regret bound over the entire time horizon that quantifies the impact of key parameters and the intensity of changes in dynamic environments. Numerical experiments validate the efficacy of the proposed algorithm.
comment: 11 pages
☆ ColonyOS -- A Meta-Operating System for Distributed Computing Across Heterogeneous Platform
This paper presents ColonyOS, an open-source meta-operating system designed to improve integration and utilization of diverse computing platforms, including IoT, edge, cloud, and HPC. Operating as an overlay, ColonyOS can interface with a wide range of computing environments, fostering creation of so-called compute continuums. This makes it possible to develop AI workflows and applications that can operate across platforms. At its core, ColonyOS consists of distributed executors that integrate with various underlying platforms based on a distributed microservice architecture. These executors collectively form a colony, serving as a unified computing unit. To enable secure integration of various platforms, each colony is provisioned with precisely the resources needed, and all communication is confined within the colony governed by a strict zero-trust security protocol. Interaction with ColonyOS is done by submitting functional meta-descriptions of computational tasks, called function specifications. These are sent to a Colonies server, which acts as intermediary between applications and the executors. Upon assignment, an executor interprets the meta-description and translates it into an executable format, e.g. a Kubernetes deployment description, a Slurm script, or a direct function call within the executor. Furthermore, a built-in meta-file system enables data synchronization directives to be included in meta-descriptions, enabling seamless data management across platforms. Ultimately, ColonyOS paves the way for development of hyper-distributed applications and workflows, which can seamlessly operate in a computing continuum. The paper describes design principles and implementation details of ColonyOS.
comment: See https://colonyos.io for more information
☆ FedAC: A Adaptive Clustered Federated Learning Framework for Heterogeneous Data
Clustered federated learning (CFL) is proposed to mitigate the performance deterioration stemming from data heterogeneity in federated learning (FL) by grouping similar clients for cluster-wise model training. However, current CFL methods struggle due to inadequate integration of global and intra-cluster knowledge and the absence of an efficient online model similarity metric, while treating the cluster count as a fixed hyperparameter limits flexibility and robustness. In this paper, we propose an adaptive CFL framework, named FedAC, which (1) efficiently integrates global knowledge into intra-cluster learning by decoupling neural networks and utilizing distinct aggregation methods for each submodule, significantly enhancing performance; (2) includes a costeffective online model similarity metric based on dimensionality reduction; (3) incorporates a cluster number fine-tuning module for improved adaptability and scalability in complex, heterogeneous environments. Extensive experiments show that FedAC achieves superior empirical performance, increasing the test accuracy by around 1.82% and 12.67% on CIFAR-10 and CIFAR-100 datasets, respectively, under different non-IID settings compared to SOTA methods.
comment: 14 pages, 4 figures
☆ Raptor: Distributed Scheduling for Serverless Functions
Serverless platforms that poorly schedule function requests inspire developers to implement workarounds to issues like high cold start latencies, poor fault tolerance, and limited support for parallel processing. These solutions litter environments with idle containers and add unnecessary pressure to the already underperforming scheduling services. An effective serverless scheduling policy should encourage developers to write small and reusable snippets of code, and give operators the freedom to administer cluster workloads however necessary in order to meet their operational demands. To this end, we have designed a distributed scheduling service that integrates with existing serverless frameworks. Our service addresses three key issues that affect modern serverless platforms; high cold start latencies, poor fault tolerance, and limited native support for parallel processing patterns like fork-join and map-reduce. We have built a prototype that integrates with the existing OpenWhisk services, and is fully backwards compatible with the existing implementation. The updated architecture improves performance and adds new scheduling and security features. Our empirical results demonstrate that our scheduler reduces cold start execution latencies by up to 80, steady state latencies by up to 10, and does so with negligible time and memory overhead.
☆ Distributed Simulation of Large Multi-body Systems
We present a technique designed for parallelizing large rigid body simulations, capable of exploiting multiple CPU cores within a computer and across a network. Our approach can be applied to simulate both unilateral and bilateral constraints, requiring straightforward modifications to the underlying physics engine. Starting from an approximate partitioning, we identify interface bodies and add them to overlapping sets such that they are simulated by multiple workers. At each timestep, we blend the states of overlap bodies using weights based on graph geodesic distances within the constraint graph. The use of overlap simulation also allows us to perform load balancing using efficient local evaluations of the constraint graph. We demonstrate our technique's scalability and load-balancing capabilities using several large-scale scenes.
comment: For associated video, see https://www.youtube.com/watch?v=2gg-YnIGJ-w
☆ Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models
With Dynamic Resource Management (DRM) the resources assigned to a job can be changed dynamically during its execution. From the system's perspective, DRM opens a new level of flexibility in resource allocation and job scheduling and therefore has the potential to improve system efficiency metrics such as the utilization rate, job throughput, energy efficiency, and responsiveness. From the application perspective, users can tailor the resources they request to their needs offering potential optimizations in queuing time or charged costs. Despite these obvious advantages and many attempts over the last decade to establish DRM in HPC, it remains a concept discussed in academia rather than being successfully deployed on production systems. This stems from the fact that support for DRM requires changes in all the layers of the HPC system software stack including applications, programming models, process managers, and resource management software, as well as an extensive and holistic co-design process to establish new techniques and policies for scheduling and resource optimization. In this work, we therefore start with the assumption that resources are accessible by processes executed either on them (e.g., on CPU) or controlling them (e.g., GPU-offloading). Then, the overall DRM problem can be decomposed into dynamic process management (DPM) and dynamic resource mapping or allocation (DRA). The former determines which processes (or which change in processes) must be managed and the latter identifies the resources where they will be executed. The interfaces for such \mbox{DPM/DPA} in these layers need to be standardized, which requires a careful design to be interoperable while providing high flexibility. Based on a survey of existing approaches we propose design principles, that form the basis of a holistic approach to DMR in HPC and provide a prototype implementation using MPI.
☆ A Unified CPU-GPU Protocol for GNN Training
Training a Graph Neural Network (GNN) model on large-scale graphs involves a high volume of data communication and compu- tations. While state-of-the-art CPUs and GPUs feature high computing power, the Standard GNN training protocol adopted in existing GNN frameworks cannot efficiently utilize the platform resources. To this end, we propose a novel Unified CPU-GPU protocol that can improve the resource utilization of GNN training on a CPU-GPU platform. The Unified CPU-GPU protocol instantiates multiple GNN training processes in parallel on both the CPU and the GPU. By allocating training processes on the CPU to perform GNN training collaboratively with the GPU, the proposed protocol improves the platform resource utilization and reduces the CPU-GPU data transfer overhead. Since the performance of a CPU and a GPU varies, we develop a novel load balancer that balances the workload dynamically between CPUs and GPUs during runtime. We evaluate our protocol using two representative GNN sampling algorithms, with two widely-used GNN models, on three datasets. Compared with the standard training protocol adopted in the state-of-the-art GNN frameworks, our protocol effectively improves resource utilization and overall training time. On a platform where the GPU moderately outperforms the CPU, our protocol speeds up GNN training by up to 1.41x. On a platform where the GPU significantly outperforms the CPU, our protocol speeds up GNN training by up to 1.26x. Our protocol is open-sourced and can be seamlessly integrated into state-of-the-art GNN frameworks and accelerate GNN training. Our protocol particularly benefits those with limited GPU access due to its high demand.
comment: To appear in 21st ACM International Conference on Computing Frontiers (CF' 24)
☆ Towards Secure and Trusted-by-Design Smart Contracts
Distributed immutable ledgers, or blockchains, allow the secure digitization of evidential transactions without relying on a trusted third-party. Evidential transactions involve the exchange of any form of physical evidence, such as money, birth certificate, visas, tickets, etc. Most of the time, evidential transactions occur in the context of complex procedures, called evidential protocols, among physical agents. The blockchain provides the mechanisms to transfer evidence, while smart contracts - programs executing within the blockchain in a decentralized and replicated fashion - allow encoding evidential protocols on top of a blockchain. As a smart contract foregoes trusted third-parties and runs on several machines anonymously, it constitutes a highly critical program that has to be secure and trusted-by-design. While most of the current smart contract languages focus on easy programmability, they do not directly address the need of guaranteeing trust and accountability, which becomes a significant issue when evidential protocols are encoded as smart contracts.
comment: 17 pages, 1 algorithm, The 29th Francophone Days of Application Languages - JFLA 2018
☆ Lessons Learned from Building Edge Software System Testbeds
Edge computing requires the complex software interaction of geo-distributed, heterogeneous components. The growing research and industry interest in edge computing software systems has necessitated exploring ways of testing and evaluating edge software at scale without relying on physical infrastructure. Beyond simulation, virtual testbeds that emulate edge infrastructure can provide a cost-efficient yet realistic environment to evaluate edge software. In this experience paper, we share lessons learned from building a total of five edge software testbeds. We describe pitfalls in architecture and development as well as experiences from having students use our testbed tooling in distributed systems prototyping classes. While we remain confident that building custom testbed tooling is the right approach for edge computing researchers and practitioners alike, we hope this paper allows others to avoid common mistakes and benefit from our experience.
☆ Union: An Automatic Workload Manager for Accelerating Network Simulation
With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this paper, we present Union, a workload manager that provides an automatic framework to facilitate hybrid workload simulation in CODES. Furthermore, we use Union, along with CODES, to investigate various hybrid workloads composed of traditional simulation applications and emerging learning applications on two dragonfly systems. The experiment results show that both message latency and communication time are important performance metrics to evaluate network interference. Network interference on HPC applications is more reflected by the message latency variation, whereas ML application performance depends more on the communication time.
☆ SignSGD with Federated Voting
Distributed learning is commonly used for accelerating model training by harnessing the computational capabilities of multiple-edge devices. However, in practical applications, the communication delay emerges as a bottleneck due to the substantial information exchange required between workers and a central parameter server. SignSGD with majority voting (signSGD-MV) is an effective distributed learning algorithm that can significantly reduce communication costs by one-bit quantization. However, due to heterogeneous computational capabilities, it fails to converge when the mini-batch sizes differ among workers. To overcome this, we propose a novel signSGD optimizer with \textit{federated voting} (signSGD-FV). The idea of federated voting is to exploit learnable weights to perform weighted majority voting. The server learns the weights assigned to the edge devices in an online fashion based on their computational capabilities. Subsequently, these weights are employed to decode the signs of the aggregated local gradients in such a way to minimize the sign decoding error probability. We provide a unified convergence rate analysis framework applicable to scenarios where the estimated weights are known to the parameter server either perfectly or imperfectly. We demonstrate that the proposed signSGD-FV algorithm has a theoretical convergence guarantee even when edge devices use heterogeneous mini-batch sizes. Experimental results show that signSGD-FV outperforms signSGD-MV, exhibiting a faster convergence rate, especially in heterogeneous mini-batch sizes.
♻ ☆ ProvDeploy: Provenance-oriented Containerization of High Performance Computing Scientific Workflows
Many existing scientific workflows require High Performance Computing environments to produce results in a timely manner. These workflows have several software library components and use different environments, making the deployment and execution of the software stack not trivial. This complexity increases if the user needs to add provenance data capture services to the workflow. This manuscript introduces ProvDeploy to assist the user in configuring containers for scientific workflows with integrated provenance data capture. ProvDeploy was evaluated with a Scientific Machine Learning workflow, exploring containerization strategies focused on provenance in two distinct HPC environments
♻ ☆ The Implications of Decentralization in Blockchained Federated Learning: Evaluating the Impact of Model Staleness and Inconsistencies
Blockchain promises to enhance distributed machine learning (ML) approaches such as federated learning (FL) by providing further decentralization, security, immutability, and trust, which are key properties for enabling collaborative intelligence in next-generation applications. Nonetheless, the intrinsic decentralized operation of peer-to-peer (P2P) blockchain nodes leads to an uncharted setting for FL, whereby the concepts of FL round and global model become meaningless, as devices' synchronization is lost without the figure of a central orchestrating server. In this paper, we study the practical implications of outsourcing the orchestration of FL to a democratic setting such as in a blockchain. In particular, we focus on the effects that model staleness and inconsistencies, endorsed by blockchains' modus operandi, have on the training procedure held by FL devices asynchronously. Using simulation, we evaluate the blockchained FL operation by applying two different ML models (ranging from low to high complexity) on the well-known MNIST and CIFAR-10 datasets, respectively, and focus on the accuracy and timeliness of the solutions. Our results show the high impact of model inconsistencies on the accuracy of the models (up to a ~35% decrease in prediction accuracy), which underscores the importance of properly designing blockchain systems based on the characteristics of the underlying FL application.
♻ ☆ Multi-Agent Online Graph Exploration on Cycles and Tadpole Graphs
We study the problem of multi-agent online graph exploration, in which a team of k agents has to explore a given graph, starting and ending on the same node. The graph is initially unknown. Whenever a node is visited by an agent, its neighborhood and adjacent edges are revealed. The agents share a global view of the explored parts of the graph. The cost of the exploration has to be minimized, where cost either describes the time needed for the entire exploration (time model), or the length of the longest path traversed by any agent (energy model). We investigate graph exploration on cycles and tadpole graphs for 2-4 agents, providing optimal results on the competitive ratio in the energy model (1-competitive with two agents on cycles and three agents on tadpole graphs), and for tadpole graphs in the time model (1.5-competitive with four agents). We also show competitive upper bounds of 2 for the exploration of tadpole graphs with three agents, and 2.5 for the exploration of tadpole graphs with two agents in the time model.
comment: v2: Update Related Work, more detailed description of models in Motivation
♻ ☆ Plotinus: A Satellite Internet Digital Twin System
The development of an integrated space-air-ground network (SAGIN) requires sophisticated satellite Internet emulation tools that can handle complex, dynamic topologies and offer in-depth analysis. Existing emulation platforms struggle with challenges like the need for detailed implementation across all network layers, real-time response, and scalability. This paper proposes a digital twin system based on microservices for satellite Internet emulation, namely Plotinus, which aims to solve these problems. Plotinus features a modular design, allowing for easy replacement of the physical layer to emulate different aerial vehicles and analyze channel interference. It also enables replacing path computation methods to simplify testing and deploying algorithms. In particular, Plotinus allows for real-time emulation with live network traffic, enhancing practical network models. The evaluation result shows Plotinus's effective emulation of dynamic satellite networks with real-world devices. Its adaptability for various communication models and algorithm testing highlights Plotinus's role as a vital tool for developing and analyzing SAGIN systems, offering a cross-layer, real-time and scalable digital twin system.
♻ ☆ Accelerating Graph Neural Networks on Real Processing-In-Memory Systems
Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML framework that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively, to match their algorithmic nature. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim will be open-sourced to enable the widespread use of PIM systems in GNNs.
♻ ☆ PICO: Pipeline Inference Framework for Versatile CNNs on Diverse Mobile Devices
Distributing the inference of convolutional neural network (CNN) to multiple mobile devices has been studied in recent years to achieve real-time inference without losing accuracy. However, how to map CNN to devices remains a challenge. On the one hand, scheduling the workload of state-of-the-art CNNs with multiple devices is NP-Hard because the structures of CNNs are directed acyclic graphs (DAG) rather than simple chains. On the other hand, distributing the inference workload suffers from expensive communication and unbalanced computation due to the wireless environment and heterogeneous devices. This paper presents PICO, a pipeline cooperation framework to accelerate the inference of versatile CNNs on diverse mobile devices. At its core, PICO features: (1) a generic graph partition algorithm that considers the characteristics of any given CNN and orchestrates it into a list of model pieces with suitable granularity, and (2) a many-to-many mapping algorithm that produces the best pipeline configuration for heterogeneous devices. In our experiment with 2 ~ 8 Raspberry-Pi devices, the throughput can be improved by 1.8 ~ 6.8x under different CPU frequencies.
comment: Accepted by IEEE Transactions on Mobile Computing
♻ ☆ Toward parallel intelligence: an interdisciplinary solution for complex systems
The growing complexity of real-world systems necessitates interdisciplinary solutions to confront myriad challenges in modeling, analysis, management, and control. To meet these demands, the parallel systems method rooted in Artificial systems, Computational experiments, and Parallel execution (ACP) approach has been developed. The method cultivates a cycle, termed parallel intelligence, which iteratively creates data, acquires knowledge, and refines the actual system. Over the past two decades, the parallel systems method has continuously woven advanced knowledge and technologies from various disciplines, offering versatile interdisciplinary solutions for complex systems across diverse fields. This review explores the origins and fundamental concepts of the parallel systems method, showcasing its accomplishments as a diverse array of parallel technologies and applications, while also prognosticating potential challenges. We posit that this method will considerably augment sustainable development while enhancing interdisciplinary communication and cooperation.
comment: 41 pages, 6 figures. The Innovation (2023)
Machine Learning
☆ One-Shot Domain Incremental Learning IJCNN
Domain incremental learning (DIL) has been discussed in previous studies on deep neural network models for classification. In DIL, we assume that samples on new domains are observed over time. The models must classify inputs on all domains. In practice, however, we may encounter a situation where we need to perform DIL under the constraint that the samples on the new domain are observed only infrequently. Therefore, in this study, we consider the extreme case where we have only one sample from the new domain, which we call one-shot DIL. We first empirically show that existing DIL methods do not work well in one-shot DIL. We have analyzed the reason for this failure through various investigations. According to our analysis, we clarify that the difficulty of one-shot DIL is caused by the statistics in the batch normalization layers. Therefore, we propose a technique regarding these statistics and demonstrate the effectiveness of our technique through experiments on open datasets.
comment: accepted at IEEE International Joint Conference on Neural Networks (IJCNN) 2024
☆ Assessing the Performance of Deep Learning for Automated Gleason Grading in Prostate Cancer
Prostate cancer is a dominant health concern calling for advanced diagnostic tools. Utilizing digital pathology and artificial intelligence, this study explores the potential of 11 deep neural network architectures for automated Gleason grading in prostate carcinoma focusing on comparing traditional and recent architectures. A standardized image classification pipeline, based on the AUCMEDI framework, facilitated robust evaluation using an in-house dataset consisting of 34,264 annotated tissue tiles. The results indicated varying sensitivity across architectures, with ConvNeXt demonstrating the strongest performance. Notably, newer architectures achieved superior performance, even though with challenges in differentiating closely related Gleason grades. The ConvNeXt model was capable of learning a balance between complexity and generalizability. Overall, this study lays the groundwork for enhanced Gleason grading systems, potentially improving diagnostic efficiency for prostate cancer.
☆ Synapse: Learning Preferential Concepts from Visual Demonstrations
This paper addresses the problem of preference learning, which aims to learn user-specific preferences (e.g., "good parking spot", "convenient drop-off location") from visual input. Despite its similarity to learning factual concepts (e.g., "red cube"), preference learning is a fundamentally harder problem due to its subjective nature and the paucity of person-specific training data. We address this problem using a new framework called Synapse, which is a neuro-symbolic approach designed to efficiently learn preferential concepts from limited demonstrations. Synapse represents preferences as neuro-symbolic programs in a domain-specific language (DSL) that operates over images, and leverages a novel combination of visual parsing, large language models, and program synthesis to learn programs representing individual preferences. We evaluate Synapse through extensive experimentation including a user case study focusing on mobility-related concepts in mobile robotics and autonomous driving. Our evaluation demonstrates that Synapse significantly outperforms existing baselines as well as its own ablations. The code and other details can be found on the project website https://amrl.cs.utexas.edu/synapse .
comment: 23 pages, 7 figures; Preprint
☆ A note on generalization bounds for losses with finite moments
This paper studies the truncation method from Alquier [1] to derive high-probability PAC-Bayes bounds for unbounded losses with heavy tails. Assuming that the $p$-th moment is bounded, the resulting bounds interpolate between a slow rate $1 / \sqrt{n}$ when $p=2$, and a fast rate $1 / n$ when $p \to \infty$ and the loss is essentially bounded. Moreover, the paper derives a high-probability PAC-Bayes bound for losses with a bounded variance. This bound has an exponentially better dependence on the confidence parameter and the dependency measure than previous bounds in the literature. Finally, the paper extends all results to guarantees in expectation and single-draw PAC-Bayes. In order to so, it obtains analogues of the PAC-Bayes fast rate bound for bounded losses from [2] in these settings.
comment: 9 pages: 5 of main text, 1 of references, and 3 of appendices
☆ Symmetric Basis Convolutions for Learning Lagrangian Fluid Mechanics ICLR
Learning physical simulations has been an essential and central aspect of many recent research efforts in machine learning, particularly for Navier-Stokes-based fluid mechanics. Classic numerical solvers have traditionally been computationally expensive and challenging to use in inverse problems, whereas Neural solvers aim to address both concerns through machine learning. We propose a general formulation for continuous convolutions using separable basis functions as a superset of existing methods and evaluate a large set of basis functions in the context of (a) a compressible 1D SPH simulation, (b) a weakly compressible 2D SPH simulation, and (c) an incompressible 2D SPH Simulation. We demonstrate that even and odd symmetries included in the basis functions are key aspects of stability and accuracy. Our broad evaluation shows that Fourier-based continuous convolutions outperform all other architectures regarding accuracy and generalization. Finally, using these Fourier-based networks, we show that prior inductive biases, such as window functions, are no longer necessary. An implementation of our approach, as well as complete datasets and solver implementations, is available at https://github.com/tum-pbs/SFBC.
comment: Published at International Conference on Learning Representation (ICLR) 2024, 54 pages, 39 figures
☆ DeepGleason: a System for Automated Gleason Grading of Prostate Cancer using Deep Neural Networks
Advances in digital pathology and artificial intelligence (AI) offer promising opportunities for clinical decision support and enhancing diagnostic workflows. Previous studies already demonstrated AI's potential for automated Gleason grading, but lack state-of-the-art methodology and model reusability. To address this issue, we propose DeepGleason: an open-source deep neural network based image classification system for automated Gleason grading using whole-slide histopathology images from prostate tissue sections. Implemented with the standardized AUCMEDI framework, our tool employs a tile-wise classification approach utilizing fine-tuned image preprocessing techniques in combination with a ConvNeXt architecture which was compared to various state-of-the-art architectures. The neural network model was trained and validated on an in-house dataset of 34,264 annotated tiles from 369 prostate carcinoma slides. We demonstrated that DeepGleason is capable of highly accurate and reliable Gleason grading with a macro-averaged F1-score of 0.806, AUC of 0.991, and Accuracy of 0.974. The internal architecture comparison revealed that the ConvNeXt model was superior performance-wise on our dataset to established and other modern architectures like transformers. Furthermore, we were able to outperform the current state-of-the-art in tile-wise fine-classification with a sensitivity and specificity of 0.94 and 0.98 for benign vs malignant detection as well as of 0.91 and 0.75 for Gleason 3 vs Gleason 4 & 5 classification, respectively. Our tool contributes to the wider adoption of AI-based Gleason grading within the research community and paves the way for broader clinical application of deep learning models in digital pathology. DeepGleason is open-source and publicly available for research application in the following Git repository: https://github.com/frankkramer-lab/DeepGleason.
☆ FOOL: Addressing the Downlink Bottleneck in Satellite Computing with Neural Feature Compression
Nanosatellite constellations equipped with sensors capturing large geographic regions provide unprecedented opportunities for Earth observation. As constellation sizes increase, network contention poses a downlink bottleneck. Orbital Edge Computing (OEC) leverages limited onboard compute resources to reduce transfer costs by processing the raw captures at the source. However, current solutions have limited practicability due to reliance on crude filtering methods or over-prioritizing particular downstream tasks. This work presents FOOL, an OEC-native and task-agnostic feature compression method that preserves prediction performance. FOOL partitions high-resolution satellite imagery to maximize throughput. Further, it embeds context and leverages inter-tile dependencies to lower transfer costs with negligible overhead. While FOOL is a feature compressor, it can recover images with competitive scores on perceptual quality measures at lower bitrates. We extensively evaluate transfer cost reduction by including the peculiarity of intermittently available network connections in low earth orbit. Lastly, we test the feasibility of our system for standardized nanosatellite form factors. We demonstrate that FOOL permits downlinking over 100x the data volume without relying on prior information on the downstream tasks.
comment: 18 pages, double column, 19 figures, 7 tables, Initial Submission to IEEE Transactions on Mobile Computing
☆ Understanding the Functional Roles of Modelling Components in Spiking Neural Networks
Spiking neural networks (SNNs), inspired by the neural circuits of the brain, are promising in achieving high computational efficiency with biological fidelity. Nevertheless, it is quite difficult to optimize SNNs because the functional roles of their modelling components remain unclear. By designing and evaluating several variants of the classic model, we systematically investigate the functional roles of key modelling components, leakage, reset, and recurrence, in leaky integrate-and-fire (LIF) based SNNs. Through extensive experiments, we demonstrate how these components influence the accuracy, generalization, and robustness of SNNs. Specifically, we find that the leakage plays a crucial role in balancing memory retention and robustness, the reset mechanism is essential for uninterrupted temporal processing and computational efficiency, and the recurrence enriches the capability to model complex dynamics at a cost of robustness degradation. With these interesting observations, we provide optimization suggestions for enhancing the performance of SNNs in different scenarios. This work deepens the understanding of how SNNs work, which offers valuable guidance for the development of more effective and robust neuromorphic models.
☆ Graph Augmentation for Recommendation ICDE 2024
Graph augmentation with contrastive learning has gained significant attention in the field of recommendation systems due to its ability to learn expressive user representations, even when labeled data is limited. However, directly applying existing GCL models to real-world recommendation environments poses challenges. There are two primary issues to address. Firstly, the lack of consideration for data noise in contrastive learning can result in noisy self-supervised signals, leading to degraded performance. Secondly, many existing GCL approaches rely on graph neural network (GNN) architectures, which can suffer from over-smoothing problems due to non-adaptive message passing. To address these challenges, we propose a principled framework called GraphAug. This framework introduces a robust data augmentor that generates denoised self-supervised signals, enhancing recommender systems. The GraphAug framework incorporates a graph information bottleneck (GIB)-regularized augmentation paradigm, which automatically distills informative self-supervision information and adaptively adjusts contrastive view generation. Through rigorous experimentation on real-world datasets, we thoroughly assessed the performance of our novel GraphAug model. The outcomes consistently unveil its superiority over existing baseline methods. The source code for our model is publicly available at: https://github.com/HKUDS/GraphAug.
comment: 13 pages and accepted by ICDE 2024
☆ A Novel Loss Function-based Support Vector Machine for Binary Classification
The previous support vector machine(SVM) including $0/1$ loss SVM, hinge loss SVM, ramp loss SVM, truncated pinball loss SVM, and others, overlooked the degree of penalty for the correctly classified samples within the margin. This oversight affects the generalization ability of the SVM classifier to some extent. To address this limitation, from the perspective of confidence margin, we propose a novel Slide loss function ($\ell_s$) to construct the support vector machine classifier($\ell_s$-SVM). By introducing the concept of proximal stationary point, and utilizing the property of Lipschitz continuity, we derive the first-order optimality conditions for $\ell_s$-SVM. Based on this, we define the $\ell_s$ support vectors and working set of $\ell_s$-SVM. To efficiently handle $\ell_s$-SVM, we devise a fast alternating direction method of multipliers with the working set ($\ell_s$-ADMM), and provide the convergence analysis. The numerical experiments on real world datasets confirm the robustness and effectiveness of the proposed method.
☆ Bridging the Sim-to-Real Gap with Bayesian Inference
We present SIM-FSVGD for learning robot dynamics from data. As opposed to traditional methods, SIM-FSVGD leverages low-fidelity physical priors, e.g., in the form of simulators, to regularize the training of neural network models. While learning accurate dynamics already in the low data regime, SIM-FSVGD scales and excels also when more data is available. We empirically show that learning with implicit physical priors results in accurate mean model estimation as well as precise uncertainty quantification. We demonstrate the effectiveness of SIM-FSVGD in bridging the sim-to-real gap on a high-performance RC racecar system. Using model-based RL, we demonstrate a highly dynamic parking maneuver with drifting, using less than half the data compared to the state of the art.
☆ Multi-Scale Texture Loss for CT denoising with GANs
Generative Adversarial Networks (GANs) have proved as a powerful framework for denoising applications in medical imaging. However, GAN-based denoising algorithms still suffer from limitations in capturing complex relationships within the images. In this regard, the loss function plays a crucial role in guiding the image generation process, encompassing how much a synthetic image differs from a real image. To grasp highly complex and non-linear textural relationships in the training process, this work presents a loss function that leverages the intrinsic multi-scale nature of the Gray-Level-Co-occurrence Matrix (GLCM). Although the recent advances in deep learning have demonstrated superior performance in classification and detection tasks, we hypothesize that its information content can be valuable when integrated into GANs' training. To this end, we propose a differentiable implementation of the GLCM suited for gradient-based optimization. Our approach also introduces a self-attention layer that dynamically aggregates the multi-scale texture information extracted from the images. We validate our approach by carrying out extensive experiments in the context of low-dose CT denoising, a challenging application that aims to enhance the quality of noisy CT scans. We utilize three publicly available datasets, including one simulated and two real datasets. The results are promising as compared to other well-established loss functions, being also consistent across three different GAN architectures. The code is available at: https://github.com/FrancescoDiFeola/DenoTextureLoss
☆ A comparative analysis of embedding models for patent similarity
This paper makes two contributions to the field of text-based patent similarity. First, it compares the performance of different kinds of patent-specific pretrained embedding models, namely static word embeddings (such as word2vec and doc2vec models) and contextual word embeddings (such as transformers based models), on the task of patent similarity calculation. Second, it compares specifically the performance of Sentence Transformers (SBERT) architectures with different training phases on the patent similarity task. To assess the models' performance, we use information about patent interferences, a phenomenon in which two or more patent claims belonging to different patent applications are proven to be overlapping by patent examiners. Therefore, we use these interferences cases as a proxy for maximum similarity between two patents, treating them as ground-truth to evaluate the performance of the different embedding models. Our results point out that, first, Patent SBERT-adapt-ub, the domain adaptation of the pretrained Sentence Transformer architecture proposed in this research, outperforms the current state-of-the-art in patent similarity. Second, they show that, in some cases, large static models performances are still comparable to contextual ones when trained on extensive data; thus, we believe that the superiority in the performance of contextual embeddings may not be related to the actual architecture but rather to the way the training phase is performed.
☆ Calibrating Bayesian UNet++ for Sub-Seasonal Forecasting ICLR 2024
Seasonal forecasting is a crucial task when it comes to detecting the extreme heat and colds that occur due to climate change. Confidence in the predictions should be reliable since a small increase in the temperatures in a year has a big impact on the world. Calibration of the neural networks provides a way to ensure our confidence in the predictions. However, calibrating regression models is an under-researched topic, especially in forecasters. We calibrate a UNet++ based architecture, which was shown to outperform physics-based models in temperature anomalies. We show that with a slight trade-off between prediction error and calibration error, it is possible to get more reliable and sharper forecasts. We believe that calibration should be an important part of safety-critical machine learning applications such as weather forecasters.
comment: Accepted as a workshop paper at "ICLR 2024 Tackling Climate Change with Machine Learning"
☆ Distributed collaborative anomalous sound detection by embedding sharing
To develop a machine sound monitoring system, a method for detecting anomalous sound is proposed. In this paper, we explore a method for multiple clients to collaboratively learn an anomalous sound detection model while keeping their raw data private from each other. In the context of industrial machine anomalous sound detection, each client possesses data from different machines or different operational states, making it challenging to learn through federated learning or split learning. In our proposed method, each client calculates embeddings using a common pre-trained model developed for sound data classification, and these calculated embeddings are aggregated on the server to perform anomalous sound detection through outlier exposure. Experiments showed that our proposed method improves the AUC of anomalous sound detection by an average of 6.8%.
☆ Enhancing Industrial Transfer Learning with Style Filter: Cost Reduction and Defect-Focus
Addressing the challenge of data scarcity in industrial domains, transfer learning emerges as a pivotal paradigm. This work introduces Style Filter, a tailored methodology for industrial contexts. By selectively filtering source domain data before knowledge transfer, Style Filter reduces the quantity of data while maintaining or even enhancing the performance of transfer learning strategy. Offering label-free operation, minimal reliance on prior knowledge, independence from specific models, and re-utilization, Style Filter is evaluated on authentic industrial datasets, highlighting its effectiveness when employed before conventional transfer strategies in the deep learning domain. The results underscore the effectiveness of Style Filter in real-world industrial applications.
comment: 17 pages, 11 figures,4 tables
☆ EDUE: Expert Disagreement-Guided One-Pass Uncertainty Estimation for Medical Image Segmentation
Deploying deep learning (DL) models in medical applications relies on predictive performance and other critical factors, such as conveying trustworthy predictive uncertainty. Uncertainty estimation (UE) methods provide potential solutions for evaluating prediction reliability and improving the model confidence calibration. Despite increasing interest in UE, challenges persist, such as the need for explicit methods to capture aleatoric uncertainty and align uncertainty estimates with real-life disagreements among domain experts. This paper proposes an Expert Disagreement-Guided Uncertainty Estimation (EDUE) for medical image segmentation. By leveraging variability in ground-truth annotations from multiple raters, we guide the model during training and incorporate random sampling-based strategies to enhance calibration confidence. Our method achieves 55% and 23% improvement in correlation on average with expert disagreements at the image and pixel levels, respectively, better calibration, and competitive segmentation performance compared to the state-of-the-art deep ensembles, requiring only a single forward pass.
☆ Deciphering the Interplay between Local Differential Privacy, Average Bayesian Privacy, and Maximum Bayesian Privacy
The swift evolution of machine learning has led to emergence of various definitions of privacy due to the threats it poses to privacy, including the concept of local differential privacy (LDP). Although widely embraced and utilized across numerous domains, this conventional approach to measure privacy still exhibits certain limitations, spanning from failure to prevent inferential disclosure to lack of consideration for the adversary's background knowledge. In this comprehensive study, we introduce Bayesian privacy and delve into the intricate relationship between local differential privacy and its Bayesian counterparts, unveiling novel insights into utility-privacy trade-offs. We introduce a framework that encapsulates both attack and defense strategies, highlighting their interplay and effectiveness. Our theoretical contributions are anchored in the rigorous definitions and relationships between Average Bayesian Privacy (ABP) and Maximum Bayesian Privacy (MBP), encapsulated by equations $\epsilon_{p,a} \leq \frac{1}{\sqrt{2}}\sqrt{(\epsilon_{p,m} + \epsilon)\cdot(e^{\epsilon_{p,m} + \epsilon} - 1)}$ and the equivalence between $\xi$-MBP and $2\xi$-LDP established under uniform prior distribution. These relationships fortify our understanding of the privacy guarantees provided by various mechanisms, leading to the realization that a mechanism satisfying $\xi$-LDP also confers $\xi$-MBP, and vice versa. Our work not only lays the groundwork for future empirical exploration but also promises to enhance the design of privacy-preserving algorithms that do not compromise on utility, thereby fostering the development of trustworthy machine learning solutions.
☆ In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Crop classification is of critical importance due to its role in studying crop pattern changes, resource management, and carbon sequestration. When employing data-driven techniques for its prediction, utilizing various temporal data sources is necessary. Deep learning models have proven to be effective for this task by mapping time series data to high-level representation for prediction. However, they face substantial challenges when dealing with multiple input patterns. The literature offers limited guidance for Multi-View Learning (MVL) scenarios, as it has primarily focused on exploring fusion strategies with specific encoders and validating them in local regions. In contrast, we investigate the impact of simultaneous selection of the fusion strategy and the encoder architecture evaluated on a global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoder architectures (LSTM, GRU, TempCNN, TAE, L-TAE) as possible MVL model configurations. The validation is on the CropHarvest dataset that provides optical, radar, and weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination, including encoder and fusion strategy, should be meticulously sought. To streamline this search process, we suggest initially identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a technical framework for researchers exploring crop classification or related tasks through a MVL approach.
comment: submitted to journal
☆ Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization
Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody design as a protein sequence-structure co-design problem, considering both rationality and functionality. Leveraging a pre-trained conditional diffusion model that jointly models sequences and structures of complementarity-determining regions (CDR) in antibodies with equivariant neural networks, we propose direct energy-based preference optimization to guide the generation of antibodies with both rational structures and considerable binding affinities to given antigens. Our method involves fine-tuning the pre-trained diffusion model using a residue-level decomposed energy preference. Additionally, we employ gradient surgery to address conflicts between various types of energy, such as attraction and repulsion. Experiments on RAbD benchmark show that our approach effectively optimizes the energy of generated antibodies and achieves state-of-the-art performance in designing high-quality antibodies with low total energy and high binding affinity, demonstrating the superiority of our approach.
☆ NSINA: A News Corpus for Sinhala LREC
The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.
comment: Accepted to LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)
☆ Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors
Explainable Artificial Intelligence (XAI) strategies play a crucial part in increasing the understanding and trustworthiness of neural networks. Nonetheless, these techniques could potentially generate misleading explanations. Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation, providing misleading information by adding visually unnoticeable artifacts into the input, while maintaining the model's accuracy. It poses a serious challenge in ensuring the reliability of XAI methods. To ensure the reliability of XAI methods poses a real challenge, we leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks. We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase, avoiding the need for extra training. The method we suggest defences against most modern explanation-aware adversarial attacks, achieving an approximate decrease of ~99\% in the Attack Success Rate (ASR) and a ~91\% reduction in the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.
☆ FedFixer: Mitigating Heterogeneous Label Noise in Federated Learning
Federated Learning (FL) heavily depends on label quality for its performance. However, the label distribution among individual clients is always both noisy and heterogeneous. The high loss incurred by client-specific samples in heterogeneous label noise poses challenges for distinguishing between client-specific and noisy label samples, impacting the effectiveness of existing label noise learning approaches. To tackle this issue, we propose FedFixer, where the personalized model is introduced to cooperate with the global model to effectively select clean client-specific samples. In the dual models, updating the personalized model solely at a local level can lead to overfitting on noisy data due to limited samples, consequently affecting both the local and global models' performance. To mitigate overfitting, we address this concern from two perspectives. Firstly, we employ a confidence regularizer to alleviate the impact of unconfident predictions caused by label noise. Secondly, a distance regularizer is implemented to constrain the disparity between the personalized and global models. We validate the effectiveness of FedFixer through extensive experiments on benchmark datasets. The results demonstrate that FedFixer can perform well in filtering noisy label samples on different clients, especially in highly heterogeneous label noise scenarios.
comment: accepted by AAA24
☆ Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients
Federated Learning (FL) is a distributed machine learning framework in communication network systems. However, the systems' Non-Independent and Identically Distributed (Non-IID) data negatively affect the convergence efficiency of the global model, since only a subset of these data samples are beneficial for model convergence. In pursuit of this subset, a reliable approach involves determining a measure of validity to rank the samples within the dataset. In this paper, We propose the BHerd strategy which selects a beneficial herd of local gradients to accelerate the convergence of the FL model. Specifically, we map the distribution of the local dataset to the local gradients and use the Herding strategy to obtain a permutation of the set of gradients, where the more advanced gradients in the permutation are closer to the average of the set of gradients. These top portion of the gradients will be selected and sent to the server for global aggregation. We conduct experiments on different datasets, models and scenarios by building a prototype system, and experimental results demonstrate that our BHerd strategy is effective in selecting beneficial local gradients to mitigate the effects brought by the Non-IID dataset, thus accelerating model convergence.
☆ Differentially Private Online Federated Learning with Correlated Noise
We propose a novel differentially private algorithm for online federated learning that employs temporally correlated noise to improve the utility while ensuring the privacy of the continuously released models. To address challenges stemming from DP noise and local updates with streaming noniid data, we develop a perturbed iterate analysis to control the impact of the DP noise on the utility. Moreover, we demonstrate how the drift errors from local updates can be effectively managed under a quasi-strong convexity condition. Subject to an $(\epsilon, \delta)$-DP budget, we establish a dynamic regret bound over the entire time horizon that quantifies the impact of key parameters and the intensity of changes in dynamic environments. Numerical experiments validate the efficacy of the proposed algorithm.
comment: 11 pages
☆ Causal Discovery from Poisson Branching Structural Causal Model Using High-Order Cumulant with Path Analysis AAAI-2024
Count data naturally arise in many fields, such as finance, neuroscience, and epidemiology, and discovering causal structure among count data is a crucial task in various scientific and industrial scenarios. One of the most common characteristics of count data is the inherent branching structure described by a binomial thinning operator and an independent Poisson distribution that captures both branching and noise. For instance, in a population count scenario, mortality and immigration contribute to the count, where survival follows a Bernoulli distribution, and immigration follows a Poisson distribution. However, causal discovery from such data is challenging due to the non-identifiability issue: a single causal pair is Markov equivalent, i.e., $X\rightarrow Y$ and $Y\rightarrow X$ are distributed equivalent. Fortunately, in this work, we found that the causal order from $X$ to its child $Y$ is identifiable if $X$ is a root vertex and has at least two directed paths to $Y$, or the ancestor of $X$ with the most directed path to $X$ has a directed path to $Y$ without passing $X$. Specifically, we propose a Poisson Branching Structure Causal Model (PB-SCM) and perform a path analysis on PB-SCM using high-order cumulants. Theoretical results establish the connection between the path and cumulant and demonstrate that the path information can be obtained from the cumulant. With the path information, causal order is identifiable under some graphical conditions. A practical algorithm for learning causal structure under PB-SCM is proposed and the experiments demonstrate and verify the effectiveness of the proposed method.
comment: Accepted by AAAI-2024
☆ Human Understanding AI Paper Challenge 2024 -- Dataset Design
In 2024, we will hold a research paper competition (the third Human Understanding AI Paper Challenge) for the research and development of artificial intelligence technologies to understand human daily life. This document introduces the datasets that will be provided to participants in the competition, and summarizes the issues to consider in data processing and learning model development.
comment: 7 pages, 3 figures
☆ PathoTune: Adapting Visual Foundation Model to Pathological Specialists MICCAI 2024
As natural image understanding moves towards the pretrain-finetune era, research in pathology imaging is concurrently evolving. Despite the predominant focus on pretraining pathological foundation models, how to adapt foundation models to downstream tasks is little explored. For downstream adaptation, we propose the existence of two domain gaps, i.e., the Foundation-Task Gap and the Task-Instance Gap. To mitigate these gaps, we introduce PathoTune, a framework designed to efficiently adapt pathological or even visual foundation models to pathology-specific tasks via multi-modal prompt tuning. The proposed framework leverages Task-specific Visual Prompts and Task-specific Textual Prompts to identify task-relevant features, along with Instance-specific Visual Prompts for encoding single pathological image features. Results across multiple datasets at both patch-level and WSI-level demonstrate its superior performance over single-modality prompt tuning approaches. Significantly, PathoTune facilitates the direct adaptation of natural visual foundation models to pathological tasks, drastically outperforming pathological foundation models with simple linear probing. The code will be available upon acceptance.
comment: Submitted to MICCAI 2024
☆ LSTTN: A Long-Short Term Transformer-based Spatio-temporal Neural Network for Traffic Flow Forecasting
Accurate traffic forecasting is a fundamental problem in intelligent transportation systems and learning long-range traffic representations with key information through spatiotemporal graph neural networks (STGNNs) is a basic assumption of current traffic flow prediction models. However, due to structural limitations, existing STGNNs can only utilize short-range traffic flow data; therefore, the models cannot adequately learn the complex trends and periodic features in traffic flow. Besides, it is challenging to extract the key temporal information from the long historical traffic series and obtain a compact representation. To solve the above problems, we propose a novel LSTTN (Long-Short Term Transformer-based Network) framework comprehensively considering the long- and short-term features in historical traffic flow. First, we employ a masked subseries Transformer to infer the content of masked subseries from a small portion of unmasked subseries and their temporal context in a pretraining manner, forcing the model to efficiently learn compressed and contextual subseries temporal representations from long historical series. Then, based on the learned representations, long-term trend is extracted by using stacked 1D dilated convolution layers, and periodic features are extracted by dynamic graph convolution layers. For the difficulties in making time-step level prediction, LSTTN adopts a short-term trend extractor to learn fine-grained short-term temporal features. Finally, LSTTN fuses the long-term trend, periodic features and short-term features to obtain the prediction results. Experiments on four real-world datasets show that in 60-minute-ahead long-term forecasting, the LSTTN model achieves a minimum improvement of 5.63\% and a maximum improvement of 16.78\% over baseline models. The source code is available at https://github.com/GeoX-Lab/LSTTN.
comment: 15 pages, 10 figures, 6 tables
☆ Determined Multi-Label Learning via Similarity-Based Prompt
In multi-label classification, each training instance is associated with multiple class labels simultaneously. Unfortunately, collecting the fully precise class labels for each training instance is time- and labor-consuming for real-world applications. To alleviate this problem, a novel labeling setting termed \textit{Determined Multi-Label Learning} (DMLL) is proposed, aiming to effectively alleviate the labeling cost inherent in multi-label tasks. In this novel labeling setting, each training instance is associated with a \textit{determined label} (either "Yes" or "No"), which indicates whether the training instance contains the provided class label. The provided class label is randomly and uniformly selected from the whole candidate labels set. Besides, each training instance only need to be determined once, which significantly reduce the annotation cost of the labeling task for multi-label datasets. In this paper, we theoretically derive an risk-consistent estimator to learn a multi-label classifier from these determined-labeled training data. Additionally, we introduce a similarity-based prompt learning method for the first time, which minimizes the risk-consistent loss of large-scale pre-trained models to learn a supplemental prompt with richer semantic information. Extensive experimental validation underscores the efficacy of our approach, demonstrating superior performance compared to existing state-of-the-art methods.
comment: 10 pages, 4 figures
☆ Learning from Reduced Labels for Long-Tailed Data
Long-tailed data is prevalent in real-world classification tasks and heavily relies on supervised information, which makes the annotation process exceptionally labor-intensive and time-consuming. Unfortunately, despite being a common approach to mitigate labeling costs, existing weakly supervised learning methods struggle to adequately preserve supervised information for tail samples, resulting in a decline in accuracy for the tail classes. To alleviate this problem, we introduce a novel weakly supervised labeling setting called Reduced Label. The proposed labeling setting not only avoids the decline of supervised information for the tail samples, but also decreases the labeling costs associated with long-tailed data. Additionally, we propose an straightforward and highly efficient unbiased framework with strong theoretical guarantees to learn from these Reduced Labels. Extensive experiments conducted on benchmark datasets including ImageNet validate the effectiveness of our approach, surpassing the performance of state-of-the-art weakly supervised methods.
comment: 12 pages, 3 figures
☆ Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator ICASSP 2024
A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.
comment: Accepted to ICASSP 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/
☆ FedAC: A Adaptive Clustered Federated Learning Framework for Heterogeneous Data
Clustered federated learning (CFL) is proposed to mitigate the performance deterioration stemming from data heterogeneity in federated learning (FL) by grouping similar clients for cluster-wise model training. However, current CFL methods struggle due to inadequate integration of global and intra-cluster knowledge and the absence of an efficient online model similarity metric, while treating the cluster count as a fixed hyperparameter limits flexibility and robustness. In this paper, we propose an adaptive CFL framework, named FedAC, which (1) efficiently integrates global knowledge into intra-cluster learning by decoupling neural networks and utilizing distinct aggregation methods for each submodule, significantly enhancing performance; (2) includes a costeffective online model similarity metric based on dimensionality reduction; (3) incorporates a cluster number fine-tuning module for improved adaptability and scalability in complex, heterogeneous environments. Extensive experiments show that FedAC achieves superior empirical performance, increasing the test accuracy by around 1.82% and 12.67% on CIFAR-10 and CIFAR-100 datasets, respectively, under different non-IID settings compared to SOTA methods.
comment: 14 pages, 4 figures
☆ On the rates of convergence for learning with convolutional neural networks
We study the approximation and learning capacities of convolutional neural networks (CNNs). Our first result proves a new approximation bound for CNNs with certain constraint on the weights. Our second result gives a new analysis on the covering number of feed-forward neural networks, which include CNNs as special cases. The analysis carefully takes into account the size of the weights and hence gives better bounds than existing literature in some situations. Using these two results, we are able to derive rates of convergence for estimators based on CNNs in many learning problems. In particular, we establish minimax optimal convergence rates of the least squares based on CNNs for learning smooth functions in the nonparametric regression setting. For binary classification, we derive convergence rates for CNN classifiers with hinge loss and logistic loss. It is also shown that the obtained rates are minimax optimal in several settings.
☆ DeepMachining: Online Prediction of Machining Errors of Lathe Machines
We describe DeepMachining, a deep learning-based AI system for online prediction of machining errors of lathe machine operations. We have built and evaluated DeepMachining based on manufacturing data from factories. Specifically, we first pretrain a deep learning model for a given lathe machine's operations to learn the salient features of machining states. Then, we fine-tune the pretrained model to adapt to specific machining tasks. We demonstrate that DeepMachining achieves high prediction accuracy for multiple tasks that involve different workpieces and cutting tools. To the best of our knowledge, this work is one of the first factory experiments using pre-trained deep-learning models to predict machining errors of lathe machines.
☆ If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize important textual features for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate the important features for the VLM. Then, we inspect the descriptions to identify the features that contribute to VLM representations. We find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
comment: Code: https://github.com/BatsResearch/ex2
☆ Producing and Leveraging Online Map Uncertainty in Trajectory Prediction CVPR 2024
High-definition (HD) maps have played an integral role in the development of modern autonomous vehicle (AV) stacks, albeit with high associated labeling and maintenance costs. As a result, many recent works have proposed methods for estimating HD maps online from sensor data, enabling AVs to operate outside of previously-mapped regions. However, current online map estimation approaches are developed in isolation of their downstream tasks, complicating their integration in AV stacks. In particular, they do not produce uncertainty or confidence estimates. In this work, we extend multiple state-of-the-art online map estimation methods to additionally estimate uncertainty and show how this enables more tightly integrating online mapping with trajectory forecasting. In doing so, we find that incorporating uncertainty yields up to 50% faster training convergence and up to 15% better prediction performance on the real-world nuScenes driving dataset.
comment: 14 pages, 14 figures, 6 tables. CVPR 2024
☆ An incremental MaxSAT-based model to learn balanced rules
The increasing advancements in the field of machine learning have led to the development of numerous applications that effectively address a wide range of problems with accurate predictions. However, in certain cases, accuracy alone may not be sufficient. Many real-world problems also demand explanations and interpretability behind the predictions. One of the most popular interpretable models that are classification rules. This work aims to propose an incremental model for learning interpretable and balanced rules based on MaxSAT, called IMLIB. This new model was based on two other approaches, one based on SAT and the other on MaxSAT. The one based on SAT limits the size of each generated rule, making it possible to balance them. We suggest that such a set of rules seem more natural to be understood compared to a mixture of large and small rules. The approach based on MaxSAT, called IMLI, presents a technique to increase performance that involves learning a set of rules by incrementally applying the model in a dataset. Finally, IMLIB and IMLI are compared using diverse databases. IMLIB obtained results comparable to IMLI in terms of accuracy, generating more balanced rules with smaller sizes.
comment: 16 pages, 5 tables, submitted to BRACIS 2023 (Brazilian Conference on Intelligent Systems), accepted version published in Intelligent Systems, LNCS, vol 14195
☆ Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models IJCNN
The integration of an ensemble of deep learning models has been extensively explored to enhance defense against adversarial attacks. The diversity among sub-models increases the attack cost required to deceive the majority of the ensemble, thereby improving the adversarial robustness. While existing approaches mainly center on increasing diversity in feature representations or dispersion of first-order gradients with respect to input, the limited correlation between these diversity metrics and adversarial robustness constrains the performance of ensemble adversarial defense. In this work, we aim to enhance ensemble diversity by reducing attack transferability. We identify second-order gradients, which depict the loss curvature, as a key factor in adversarial robustness. Computing the Hessian matrix involved in second-order gradients is computationally expensive. To address this, we approximate the Hessian-vector product using differential approximation. Given that low curvature provides better robustness, our ensemble model was designed to consider the influence of curvature among different sub-models. We introduce a novel regularizer to train multiple more-diverse low-curvature network models. Extensive experiments across various datasets demonstrate that our ensemble model exhibits superior robustness against a range of attacks, underscoring the effectiveness of our approach.
comment: Accepted to The 2024 International Joint Conference on Neural Networks (IJCNN)
☆ Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data CVPR 2024
Federated learning achieves effective performance in modeling decentralized data. In practice, client data are not well-labeled, which makes it potential for federated unsupervised learning (FUSL) with non-IID data. However, the performance of existing FUSL methods suffers from insufficient representations, i.e., (1) representation collapse entanglement among local and global models, and (2) inconsistent representation spaces among local models. The former indicates that representation collapse in local model will subsequently impact the global model and other local models. The latter means that clients model data representation with inconsistent parameters due to the deficiency of supervision signals. In this work, we propose FedU2 which enhances generating uniform and unified representation in FUSL with non-IID data. Specifically, FedU2 consists of flexible uniform regularizer (FUR) and efficient unified aggregator (EUA). FUR in each client avoids representation collapse via dispersing samples uniformly, and EUA in server promotes unified representation by constraining consistent client model updating. To extensively validate the performance of FedU2, we conduct both cross-device and cross-silo evaluation experiments on two benchmark datasets, i.e., CIFAR10 and CIFAR100.
comment: CVPR 2024
☆ Concurrent Linguistic Error Detection (CLED) for Large Language Models
The wide adoption of Large language models (LLMs) makes their dependability a pressing concern. Detection of errors is the first step to mitigating their impact on a system and thus, efficient error detection for LLMs is an important issue. In many settings, the LLM is considered as a black box with no access to the internal nodes; this prevents the use of many error detection schemes that need access to the model's internal nodes. An interesting observation is that the output of LLMs in error-free operation should be valid and normal text. Therefore, when the text is not valid or differs significantly from normal text, it is likely that there is an error. Based on this observation we propose to perform Concurrent Linguistic Error Detection (CLED); this scheme extracts some linguistic features of the text generated by the LLM and feeds them to a concurrent classifier that detects errors. Since the proposed error detection mechanism only relies on the outputs of the model, then it can be used on LLMs in which there is no access to the internal nodes. The proposed CLED scheme has been evaluated on the T5 model when used for news summarization and on the OPUS-MT model when used for translation. In both cases, the same set of linguistic features has been used for error detection to illustrate the applicability of the proposed scheme beyond a specific case. The results show that CLED can detect most of the errors at a low overhead penalty. The use of the concurrent classifier also enables a trade-off between error detection effectiveness and its associated overhead, so providing flexibility to a designer.
comment: 11 pages, 6 figures, 30 references
☆ Physics-informed RL for Maximal Safety Probability Estimation
Accurate risk quantification and reachability analysis are crucial for safe control and learning, but sampling from rare events, risky states, or long-term trajectories can be prohibitively costly. Motivated by this, we study how to estimate the long-term safety probability of maximally safe actions without sufficient coverage of samples from risky states and long-term trajectories. The use of maximal safety probability in control and learning is expected to avoid conservative behaviors due to over-approximation of risk. Here, we first show that long-term safety probability, which is multiplicative in time, can be converted into additive costs and be solved using standard reinforcement learning methods. We then derive this probability as solutions of partial differential equations (PDEs) and propose Physics-Informed Reinforcement Learning (PIRL) algorithm. The proposed method can learn using sparse rewards because the physics constraints help propagate risk information through neighbors. This suggests that, for the purpose of extracting more information for efficient learning, physics constraints can serve as an alternative to reward shaping. The proposed method can also estimate long-term risk using short-term samples and deduce the risk of unsampled states. This feature is in stark contrast with the unconstrained deep RL that demands sufficient data coverage. These merits of the proposed method are demonstrated in numerical simulation.
☆ Real-time Adaptation for Condition Monitoring Signal Prediction using Label-aware Neural Processes
Building a predictive model that rapidly adapts to real-time condition monitoring (CM) signals is critical for engineering systems/units. Unfortunately, many current methods suffer from a trade-off between representation power and agility in online settings. For instance, parametric methods that assume an underlying functional form for CM signals facilitate efficient online prediction updates. However, this simplification leads to vulnerability to model specifications and an inability to capture complex signals. On the other hand, approaches based on over-parameterized or non-parametric models can excel at explaining complex nonlinear signals, but real-time updates for such models pose a challenging task. In this paper, we propose a neural process-based approach that addresses this trade-off. It encodes available observations within a CM signal into a representation space and then reconstructs the signal's history and evolution for prediction. Once trained, the model can encode an arbitrary number of observations without requiring retraining, enabling on-the-spot real-time predictions along with quantified uncertainty and can be readily updated as more online data is gathered. Furthermore, our model is designed to incorporate qualitative information (i.e., labels) from individual units. This integration not only enhances individualized predictions for each unit but also enables joint inference for both signals and their associated labels. Numerical studies on both synthetic and real-world data in reliability engineering highlight the advantageous features of our model in real-time adaptation, enhanced signal prediction with uncertainty quantification, and joint prediction for labels and signals.
☆ ProIn: Learning to Predict Trajectory Based on Progressive Interactions for Autonomous Driving
Accurate motion prediction of pedestrians, cyclists, and other surrounding vehicles (all called agents) is very important for autonomous driving. Most existing works capture map information through an one-stage interaction with map by vector-based attention, to provide map constraints for social interaction and multi-modal differentiation. However, these methods have to encode all required map rules into the focal agent's feature, so as to retain all possible intentions' paths while at the meantime to adapt to potential social interaction. In this work, a progressive interaction network is proposed to enable the agent's feature to progressively focus on relevant maps, in order to better learn agents' feature representation capturing the relevant map constraints. The network progressively encode the complex influence of map constraints into the agent's feature through graph convolutions at the following three stages: after historical trajectory encoder, after social interaction, and after multi-modal differentiation. In addition, a weight allocation mechanism is proposed for multi-modal training, so that each mode can obtain learning opportunities from a single-mode ground truth. Experiments have validated the superiority of progressive interactions to the existing one-stage interaction, and demonstrate the effectiveness of each component. Encouraging results were obtained in the challenging benchmarks.
☆ SignSGD with Federated Voting
Distributed learning is commonly used for accelerating model training by harnessing the computational capabilities of multiple-edge devices. However, in practical applications, the communication delay emerges as a bottleneck due to the substantial information exchange required between workers and a central parameter server. SignSGD with majority voting (signSGD-MV) is an effective distributed learning algorithm that can significantly reduce communication costs by one-bit quantization. However, due to heterogeneous computational capabilities, it fails to converge when the mini-batch sizes differ among workers. To overcome this, we propose a novel signSGD optimizer with \textit{federated voting} (signSGD-FV). The idea of federated voting is to exploit learnable weights to perform weighted majority voting. The server learns the weights assigned to the edge devices in an online fashion based on their computational capabilities. Subsequently, these weights are employed to decode the signs of the aggregated local gradients in such a way to minimize the sign decoding error probability. We provide a unified convergence rate analysis framework applicable to scenarios where the estimated weights are known to the parameter server either perfectly or imperfectly. We demonstrate that the proposed signSGD-FV algorithm has a theoretical convergence guarantee even when edge devices use heterogeneous mini-batch sizes. Experimental results show that signSGD-FV outperforms signSGD-MV, exhibiting a faster convergence rate, especially in heterogeneous mini-batch sizes.
☆ Learning Action-based Representations Using Invariance
Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.
☆ Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion
Modern neural networks are often trained on massive datasets that are web scraped with minimal human inspection. As a result of this insecure curation pipeline, an adversary can poison or backdoor the resulting model by uploading malicious data to the internet and waiting for a victim to scrape and train on it. Existing approaches for creating poisons and backdoors start with randomly sampled clean data, called base samples, and then modify those samples to craft poisons. However, some base samples may be significantly more amenable to poisoning than others. As a result, we may be able to craft more potent poisons by carefully choosing the base samples. In this work, we use guided diffusion to synthesize base samples from scratch that lead to significantly more potent poisons and backdoors than previous state-of-the-art attacks. Our Guided Diffusion Poisoning (GDP) base samples can be combined with any downstream poisoning or backdoor attack to boost its effectiveness. Our implementation code is publicly available at: https://github.com/hsouri/GDP .
☆ ChatDBG: An AI-Powered Debugging Assistant
This paper presents ChatDBG, the first AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to take the wheel and drive debugging by issuing commands to navigate through stacks and inspect program state; it then reports its findings and yields back control to the programmer. Our ChatDBG prototype integrates with standard debuggers including LLDB, GDB, and WinDBG for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded nearly 30,000 times.
comment: 11 pages
☆ ChatGPT Incorrectness Detection in Software Reviews
We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.
☆ Predictive Inference in Multi-environment Scenarios
We address the challenge of constructing valid confidence intervals and sets in problems of prediction across multiple environments. We investigate two types of coverage suitable for these problems, extending the jackknife and split-conformal methods to show how to obtain distribution-free coverage in such non-traditional, hierarchical data-generating scenarios. Our contributions also include extensions for settings with non-real-valued responses and a theory of consistency for predictive inference in these general problems. We demonstrate a novel resizing method to adapt to problem difficulty, which applies both to existing approaches for predictive inference with hierarchical data and the methods we develop; this reduces prediction set sizes using limited information from the test environment, a key to the methods' practical performance, which we evaluate through neurochemical sensing and species classification datasets.
☆ MEDDAP: Medical Dataset Enhancement via Diversified Augmentation Pipeline MICCAI-2024
The effectiveness of Deep Neural Networks (DNNs) heavily relies on the abundance and accuracy of available training data. However, collecting and annotating data on a large scale is often both costly and time-intensive, particularly in medical cases where practitioners are already occupied with their duties. Moreover, ensuring that the model remains robust across various scenarios of image capture is crucial in medical domains, especially when dealing with ultrasound images that vary based on the settings of different devices and the manual operation of the transducer. To address this challenge, we introduce a novel pipeline called MEDDAP, which leverages Stable Diffusion (SD) models to augment existing small datasets by automatically generating new informative labeled samples. Pretrained checkpoints for SD are typically based on natural images, and training them for medical images requires significant GPU resources due to their heavy parameters. To overcome this challenge, we introduce USLoRA (Ultrasound Low-Rank Adaptation), a novel fine-tuning method tailored specifically for ultrasound applications. USLoRA allows for selective fine-tuning of weights within SD, requiring fewer than 0.1\% of parameters compared to fully fine-tuning only the UNet portion of SD. To enhance dataset diversity, we incorporate different adjectives into the generation process prompts, thereby desensitizing the classifiers to intensity changes across different images. This approach is inspired by clinicians' decision-making processes regarding breast tumors, where tumor shape often plays a more crucial role than intensity. In conclusion, our pipeline not only outperforms classifiers trained on the original dataset but also demonstrates superior performance when encountering unseen datasets. The source code is available at https://github.com/yasamin-med/MEDDAP.
comment: submitted to miccai 2024 submitted to miccai 2024 Submitted to MICCAI-2024
☆ Graphs Generalization under Distribution Shifts
Traditional machine learning methods heavily rely on the independent and identically distribution assumption, which imposes limitations when the test distribution deviates from the training distribution. To address this crucial issue, out-of-distribution (OOD) generalization, which aims to achieve satisfactory generalization performance when faced with unknown distribution shifts, has made a significant process. However, the OOD method for graph-structured data currently lacks clarity and remains relatively unexplored due to two primary challenges. Firstly, distribution shifts on graphs often occur simultaneously on node attributes and graph topology. Secondly, capturing invariant information amidst diverse distribution shifts proves to be a formidable challenge. To overcome these obstacles, in this paper, we introduce a novel framework, namely Graph Learning Invariant Domain genERation (GLIDER). The goal is to (1) diversify variations across domains by modeling the potential seen or unseen variations of attribute distribution and topological structure and (2) minimize the discrepancy of the variation in a representation space where the target is to predict semantic labels. Extensive experiment results indicate that our model outperforms baseline methods on node-level OOD generalization across domains in distribution shift on node features and topological structures simultaneously.
♻ ☆ BioNeRF: Biologically Plausible Neural Radiance Fields for View Synthesis
This paper presents BioNeRF, a biologically plausible architecture that models scenes in a 3D representation and synthesizes new views through radiance fields. Since NeRF relies on the network weights to store the scene's 3-dimensional representation, BioNeRF implements a cognitive-inspired mechanism that fuses inputs from multiple sources into a memory-like structure, improving the storing capacity and extracting more intrinsic and correlated information. BioNeRF also mimics a behavior observed in pyramidal cells concerning contextual information, in which the memory is provided as the context and combined with the inputs of two subsequent neural models, one responsible for producing the volumetric densities and the other the colors used to render the scene. Experimental results show that BioNeRF outperforms state-of-the-art results concerning a quality measure that encodes human perception in two datasets: real-world images and synthetic data.
♻ ☆ LOCOST: State-Space Models for Long Document Abstractive Summarization EACL 2024
State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L \log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.
comment: 9 pages, 5 figures, 7 tables, EACL 2024 conference
♻ ☆ A Second Look on BASS -- Boosting Abstractive Summarization with Unified Semantic Graphs -- A Replication Study ECIR 2024
We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Advances in Information Retrieval, 46th European Conference on Information Retrieval, ECIR 2024. 16 pages, 4 figures
♻ ☆ An Analysis of Linear Time Series Forecasting Models
Despite their simplicity, linear models perform well at time series forecasting, even when pitted against deeper and more expensive models. A number of variations to the linear model have been proposed, often including some form of feature normalisation that improves model generalisation. In this paper we analyse the sets of functions expressible using these linear model architectures. In so doing we show that several popular variants of linear models for time series forecasting are equivalent and functionally indistinguishable from standard, unconstrained linear regression. We characterise the model classes for each linear variant. We demonstrate that each model can be reinterpreted as unconstrained linear regression over a suitably augmented feature set, and therefore admit closed-form solutions when using a mean-squared loss function. We provide experimental evidence that the models under inspection learn nearly identical solutions, and finally demonstrate that the simpler closed form solutions are superior forecasters across 72% of test settings.
♻ ☆ Spatio-Temporal Few-Shot Learning via Diffusive Neural Network Generation
Spatio-temporal modeling is foundational for smart city applications, yet it is often hindered by data scarcity in many cities and regions. To bridge this gap, we propose a novel generative pre-training framework, GPD, for spatio-temporal few-shot learning with urban knowledge transfer. Unlike conventional approaches that heavily rely on common feature extraction or intricate few-shot learning designs, our solution takes a novel approach by performing generative pre-training on a collection of neural network parameters optimized with data from source cities. We recast spatio-temporal few-shot learning as pre-training a generative diffusion model, which generates tailored neural networks guided by prompts, allowing for adaptability to diverse data distributions and city-specific characteristics. GPD employs a Transformer-based denoising diffusion model, which is model-agnostic to integrate with powerful spatio-temporal neural networks. By addressing challenges arising from data gaps and the complexity of generalizing knowledge across cities, our framework consistently outperforms state-of-the-art baselines on multiple real-world datasets for tasks such as traffic speed prediction and crowd flow prediction. The implementation of our approach is available: https://github.com/tsinghua-fib-lab/GPD.
♻ ☆ Improving the forecast accuracy of wind power by leveraging multiple hierarchical structure
Renewable energy generation is of utmost importance for global decarbonization. Forecasting renewable energies, particularly wind energy, is challenging due to the inherent uncertainty in wind energy generation, which depends on weather conditions. Recent advances in hierarchical forecasting through reconciliation have demonstrated a significant increase in the quality of wind energy forecasts for short-term periods. We leverage the cross-sectional and temporal hierarchical structure of turbines in wind farms and build cross-temporal hierarchies to further investigate how integrated cross-sectional and temporal dimensions can add value to forecast accuracy in wind farms. We found that cross-temporal reconciliation was superior to individual cross-sectional reconciliation at multiple temporal aggregations. Additionally, machine learning based forecasts that were cross-temporally reconciled demonstrated high accuracy at coarser temporal granularities, which may encourage adoption for short-term wind forecasts. Empirically, we provide insights for decision-makers on the best methods for forecasting high-frequency wind data across different forecasting horizons and levels.
comment: 41 pages, 14 figures
♻ ☆ LightIt: Illumination Modeling and Control for Diffusion Models
We introduce LightIt, a method for explicit illumination control for image generation. Recent generative methods lack lighting control, which is crucial to numerous artistic aspects of image generation such as setting the overall mood or cinematic appearance. To overcome these limitations, we propose to condition the generation on shading and normal maps. We model the lighting with single bounce shading, which includes cast shadows. We first train a shading estimation module to generate a dataset of real-world images and shading pairs. Then, we train a control network using the estimated shading and normals as input. Our method demonstrates high-quality image generation and lighting control in numerous scenes. Additionally, we use our generated dataset to train an identity-preserving relighting model, conditioned on an image and a target shading. Our method is the first that enables the generation of images with controllable, consistent lighting and performs on par with specialized relighting state-of-the-art methods.
comment: Project page: https://peter-kocsis.github.io/LightIt/ Video: https://youtu.be/cCfSBD5aPLI
♻ ☆ BatteryML:An Open-source platform for Machine Learning on Battery Degradation
Battery degradation remains a pivotal concern in the energy storage domain, with machine learning emerging as a potent tool to drive forward insights and solutions. However, this intersection of electrochemical science and machine learning poses complex challenges. Machine learning experts often grapple with the intricacies of battery science, while battery researchers face hurdles in adapting intricate models tailored to specific datasets. Beyond this, a cohesive standard for battery degradation modeling, inclusive of data formats and evaluative benchmarks, is conspicuously absent. Recognizing these impediments, we present BatteryML - a one-step, all-encompass, and open-source platform designed to unify data preprocessing, feature extraction, and the implementation of both traditional and state-of-the-art models. This streamlined approach promises to enhance the practicality and efficiency of research applications. BatteryML seeks to fill this void, fostering an environment where experts from diverse specializations can collaboratively contribute, thus elevating the collective understanding and advancement of battery research.The code for our project is publicly available on GitHub at https://github.com/microsoft/BatteryML.
♻ ☆ Causal Question Answering with Reinforcement Learning WWW 2024
Causal questions inquire about causal relationships between different events or phenomena. They are important for a variety of use cases, including virtual assistants and search engines. However, many current approaches to causal question answering cannot provide explanations or evidence for their answers. Hence, in this paper, we aim to answer causal questions with a causality graph, a large-scale dataset of causal relations between noun phrases along with the relations' provenance data. Inspired by recent, successful applications of reinforcement learning to knowledge graph tasks, such as link prediction and fact-checking, we explore the application of reinforcement learning on a causality graph for causal question answering. We introduce an Actor-Critic-based agent which learns to search through the graph to answer causal questions. We bootstrap the agent with a supervised learning procedure to deal with large action spaces and sparse rewards. Our evaluation shows that the agent successfully prunes the search space to answer binary causal questions by visiting less than 30 nodes per question compared to over 3,000 nodes by a naive breadth-first search. Our ablation study indicates that our supervised learning strategy provides a strong foundation upon which our reinforcement learning agent improves. The paths returned by our agent explain the mechanisms by which a cause produces an effect. Moreover, for each edge on a path, our causality graph provides its original source allowing for easy verification of paths.
comment: Accepted at WWW 2024
♻ ☆ Exploring the Adversarial Capabilities of Large Language Models
The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.
♻ ☆ On the resilience of Collaborative Learning-based Recommender Systems Against Community Detection Attack
Collaborative-learning-based recommender systems emerged following the success of collaborative learning techniques such as Federated Learning (FL) and Gossip Learning (GL). In these systems, users participate in the training of a recommender system while maintaining their history of consumed items on their devices. While these solutions seemed appealing for preserving the privacy of the participants at first glance, recent studies have revealed that collaborative learning can be vulnerable to various privacy attacks. In this paper, we study the resilience of collaborative learning-based recommender systems against a novel privacy attack called Community Detection Attack (CDA). This attack enables an adversary to identify community members based on a chosen set of items (eg., identifying users interested in specific points-of-interest). Through experiments on three real recommendation datasets using two state-of-the-art recommendation models, we evaluate the sensitivity of an FL-based recommender system as well as two flavors of Gossip Learning-based recommender systems to CDA. The results show that across all models and datasets, the FL setting is more vulnerable to CDA compared to Gossip settings. Furthermore, we assess two off-the-shelf mitigation strategies, namely differential privacy (DP) and a \emph{Share less} policy, which consists of sharing a subset of less sensitive model parameters. The findings indicate a more favorable privacy-utility trade-off for the \emph{Share less} strategy, particularly in FedRecs.
♻ ☆ Multi-agent reinforcement learning using echo-state network and its application to pedestrian dynamics
In recent years, simulations of pedestrians using the multi-agent reinforcement learning (MARL) have been studied. This study considered the roads on a grid-world environment, and implemented pedestrians as MARL agents using an echo-state network and the least squares policy iteration method. Under this environment, the ability of these agents to learn to move forward by avoiding other agents was investigated. Specifically, we considered two types of tasks: the choice between a narrow direct route and a broad detour, and the bidirectional pedestrian flow in a corridor. The simulations results indicated that the learning was successful when the density of the agents was not that high.
comment: 26 pages, 17 figures
♻ ☆ Distributionally Generative Augmentation for Fair Facial Attribute Classification CVPR 2024
Facial Attribute Classification (FAC) holds substantial promise in widespread applications. However, FAC models trained by traditional methodologies can be unfair by exhibiting accuracy inconsistencies across varied data subpopulations. This unfairness is largely attributed to bias in data, where some spurious attributes (e.g., Male) statistically correlate with the target attribute (e.g., Smiling). Most of existing fairness-aware methods rely on the labels of spurious attributes, which may be unavailable in practice. This work proposes a novel, generation-based two-stage framework to train a fair FAC model on biased data without additional annotation. Initially, we identify the potential spurious attributes based on generative models. Notably, it enhances interpretability by explicitly showing the spurious attributes in image space. Following this, for each image, we first edit the spurious attributes with a random degree sampled from a uniform distribution, while keeping target attribute unchanged. Then we train a fair FAC model by fostering model invariance to these augmentation. Extensive experiments on three common datasets demonstrate the effectiveness of our method in promoting fairness in FAC without compromising accuracy. Codes are in https://github.com/heqianpei/DiGA.
comment: CVPR 2024
♻ ☆ Preference as Reward, Maximum Preference Optimization with Importance Sampling
Preference learning is a key technology for aligning language models with human values. Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for preference scores and then optimizes the generating policy with an on-policy PPO algorithm to maximize the reward. The processing of RLHF is complex, time-consuming, and unstable. The Direct Preference Optimization (DPO) algorithm uses an off-policy algorithm to directly optimize the generating policy and eliminates the need for a reward model. DPO is more data-efficient and stable. However, DPO has a drawback of overfitting to the preference data and ignoring the KL-regularization term when the preference is deterministic. Identity mapping Preference Optimization(IPO) uses a root-finding MSE loss to incorporate KL-regularization. However, both DPO and IPO fail to properly address the KL-regularization term because the support of the preference distribution is not equal to the reference distribution. In this paper, we propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO). MPO incorporates the off-policy KL-regularization term, making regularization truly effective. MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm. Furthermore, MPO eliminates the need for a reward model and reference policy, simplifying the learning process and reducing memory usage.
♻ ☆ Data driven modeling for self-similar dynamics
Multiscale modeling of complex systems is crucial for understanding their intricacies. Data-driven multiscale modeling has emerged as a promising approach to tackle challenges associated with complex systems. On the other hand, self-similarity is prevalent in complex systems, hinting that large-scale complex systems can be modeled at a reduced cost. In this paper, we introduce a multiscale neural network framework that incorporates self-similarity as prior knowledge, facilitating the modeling of self-similar dynamical systems. For deterministic dynamics, our framework can discern whether the dynamics are self-similar. For uncertain dynamics, it can compare and determine which parameter set is closer to self-similarity. The framework allows us to extract scale-invariant kernels from the dynamics for modeling at any scale. Moreover, our method can identify the power law exponents in self-similar systems. Preliminary tests on the Ising model yielded critical exponents consistent with theoretical expectations, providing valuable insights for addressing critical phase transitions in non-equilibrium systems.
comment: 10 pages,7 figures,1 table
♻ ☆ I-PHYRE: Interactive Physical Reasoning ICLR 2024
Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.
comment: 21 pages, ICLR 2024
♻ ☆ Spectral methods for Neural Integral Equations
Neural integral equations are deep learning models based on the theory of integral equations, where the model consists of an integral operator and the corresponding equation (of the second kind) which is learned through an optimization procedure. This approach allows to leverage the nonlocal properties of integral operators in machine learning, but it is computationally expensive. In this article, we introduce a framework for neural integral equations based on spectral methods that allows us to learn an operator in the spectral domain, resulting in a cheaper computational cost, as well as in high interpolation accuracy. We study the properties of our methods and show various theoretical guarantees regarding the approximation capabilities of the model, and convergence to solutions of the numerical methods. We provide numerical experiments to demonstrate the practical effectiveness of the resulting model.
comment: 15 pages, 3 figures and 2 tables. v3: Missing hypotheses for the framework have been now added
♻ ☆ Developing and Deploying Industry Standards for Artificial Intelligence in Education (AIED): Challenges, Strategies, and Future Directions
The adoption of Artificial Intelligence in Education (AIED) holds the promise of revolutionizing educational practices by offering personalized learning experiences, automating administrative and pedagogical tasks, and reducing the cost of content creation. However, the lack of standardized practices in the development and deployment of AIED solutions has led to fragmented ecosystems, which presents challenges in interoperability, scalability, and ethical governance. This article aims to address the critical need to develop and implement industry standards in AIED, offering a comprehensive analysis of the current landscape, challenges, and strategic approaches to overcome these obstacles. We begin by examining the various applications of AIED in various educational settings and identify key areas lacking in standardization, including system interoperability, ontology mapping, data integration, evaluation, and ethical governance. Then, we propose a multi-tiered framework for establishing robust industry standards for AIED. In addition, we discuss methodologies for the iterative development and deployment of standards, incorporating feedback loops from real-world applications to refine and adapt standards over time. The paper also highlights the role of emerging technologies and pedagogical theories in shaping future standards for AIED. Finally, we outline a strategic roadmap for stakeholders to implement these standards, fostering a cohesive and ethical AIED ecosystem. By establishing comprehensive industry standards, such as those by IEEE Artificial Intelligence Standards Committee (AISC) and International Organization for Standardization (ISO), we can accelerate and scale AIED solutions to improve educational outcomes, ensuring that technological advances align with the principles of inclusivity, fairness, and educational excellence.
comment: 12 pages
♻ ☆ SEA: Sparse Linear Attention with Estimated Attention Mask
The transformer architecture has driven breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, long seqeuences pose a problem due to the quadratic complexity of the attention operation. Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches lose interpretability if they cannot produce full attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then subsequently creates a sparse attention matrix with a top-k selection to perform a sparse attention operation. For language modeling tasks (Wikitext2), previous linear and sparse attention methods show roughly two-fold worse perplexity scores over the quadratic OPT-1.3B baseline, while SEA achieves better perplexity than OPT-1.3B, using roughly half the memory of OPT-1.3B, providing interpretable attention matrix. We believe that our work will have a large practical impact, as it opens the possibility of running large transformers on resource-limited devices with less memory.
comment: 9 main pages
♻ ☆ BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network ICASSP 2024
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.
comment: Accepted at ICASSP 2024. Equation (5) in the previous version is wrong. We modified it
♻ ☆ A Transfer Attack to Image Watermarks
Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In particular, multiple studies claimed that image watermark is robust in such setting. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
♻ ☆ Don't Judge by the Look: Towards Motion Coherent Video Representation ICLR2024
Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video understanding and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video understanding, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at https://github.com/BeSpontaneous/MCA-pytorch.
comment: Accepted by ICLR2024
♻ ☆ EVOTER: Evolution of Transparent Explainable Rule-sets
Most AI systems are black boxes generating reasonable outputs for given inputs. Some domains, however, have explainability and trustworthiness requirements that cannot be directly met by these approaches. Various methods have therefore been developed to interpret black-box models after training. This paper advocates an alternative approach where the models are transparent and explainable to begin with. This approach, EVOTER, evolves rule-sets based on simple logical expressions. The approach is evaluated in several prediction/classification and prescription/policy search domains with and without a surrogate. It is shown to discover meaningful rule sets that perform similarly to black-box models. The rules can provide insight into the domain, and make biases hidden in the data explicit. It may also be possible to edit them directly to remove biases and add constraints. EVOTER thus forms a promising foundation for building trustworthy AI systems for real-world applications in the future.
♻ ☆ Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Large-scale Text-to-Image (TTI) models have become a common approach for generating training data in various generative fields. However, visual hallucinations, which contain perceptually critical defects, remain a concern, especially in non-photorealistic styles like cartoon characters. We propose a novel visual hallucination detection system for cartoon character images generated by TTI models. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. This research advances TTI models by mitigating visual hallucinations, expanding their potential in non-photorealistic domains.
comment: 11 pages, 12 figures, 1 table, Project page: https://gh-bumsookim.github.io/Cartoon-Hallucinations-Detection/
♻ ☆ Counterfactual Learning on Graphs: A Survey
Graph-structured data are pervasive in the real-world such as social networks, molecular graphs and transaction networks. Graph neural networks (GNNs) have achieved great success in representation learning on graphs, facilitating various downstream tasks. However, GNNs have several drawbacks such as lacking interpretability, can easily inherit the bias of data and cannot model casual relations. Recently, counterfactual learning on graphs has shown promising results in alleviating these drawbacks. Various approaches have been proposed for counterfactual fairness, explainability, link prediction and other applications on graphs. To facilitate the development of this promising direction, in this survey, we categorize and comprehensively review papers on graph counterfactual learning. We divide existing methods into four categories based on problems studied. For each category, we provide background and motivating examples, a general framework summarizing existing works and a detailed review of these works. We point out promising future research directions at the intersection of graph-structured data, counterfactual learning, and real-world applications. To offer a comprehensive view of resources for future studies, we compile a collection of open-source implementations, public datasets, and commonly-used evaluation metrics. This survey aims to serve as a ``one-stop-shop'' for building a unified understanding of graph counterfactual learning categories and current resources. We also maintain a repository for papers and resources and will keep updating the repository https://github.com/TimeLovercc/Awesome-Graph-Causal-Learning.
♻ ☆ Universality of almost periodicity in bounded discrete time series
We consider arbitrary bounded discrete time series. From its statistical feature, without any use of the Fourier transform, we find an almost periodic function which suitably characterizes the corresponding time series.
♻ ☆ Masked Vector Quantization
Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex high-dimensional data distributions. However, their performance relies on a long sequence of tokens per instance and a large number of codebook entries, resulting in long sampling times and considerable computation to fit the categorical posterior. To address these issues, we propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations via a stochastic winner-takes-all training regime called Multiple Hypothese Dropout (MH-Dropout). On ImageNet 64$\times$64, MVQ reduces FID in existing vector quantization architectures by up to $68\%$ at 2 tokens per instance and $57\%$ at 5 tokens. These improvements widen as codebook entries is reduced and allows for $7\textit{--}45\times$ speed-up in token sampling during inference. As an additional benefit, we find that smaller latent spaces lead to MVQ identifying transferable visual representations where multiple can be smoothly combined.
comment: A newer version of this manuscript was archived under 2312.11735
PhyloGFN: Phylogenetic inference with generative flow networks
Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.
Cryptography and Security
☆ Real-Valued Somewhat-Pseudorandom Unitaries
We explore a very simple distribution of unitaries: random (binary) phase -- Hadamard -- random (binary) phase -- random computational-basis permutation. We show that this distribution is statistically indistinguishable from random Haar unitaries for any polynomial set of orthogonal input states (in any basis) with polynomial multiplicity. This shows that even though real-valued unitaries cannot be completely pseudorandom (Haug, Bharti, Koh, arXiv:2306.11677), we can still obtain some pseudorandom properties without giving up on the simplicity of a real-valued unitary. Our analysis shows that an even simpler construction: applying a random (binary) phase followed by a random computational-basis permutation, would suffice, assuming that the input is orthogonal and \emph{flat} (that is, has high min-entropy when measured in the computational basis). Using quantum-secure one-way functions (which imply quantum-secure pseudorandom functions and permutations), we obtain an efficient cryptographic instantiation of the above.
☆ AI-Generated Video Detection via Spatio-Temporal Anomaly Learning
The advancement of generation models has led to the emergence of highly realistic artificial intelligence (AI)-generated videos. Malicious users can easily create non-existent videos to spread false information. This letter proposes an effective AI-generated video detection (AIGVDet) scheme by capturing the forensic traces with a two-branch spatio-temporal convolutional neural network (CNN). Specifically, two ResNet sub-detectors are learned separately for identifying the anomalies in spatical and optical flow domains, respectively. Results of such sub-detectors are fused to further enhance the discrimination ability. A large-scale generated video dataset (GVD) is constructed as a benchmark for model training and evaluation. Extensive experimental results verify the high generalization and robustness of our AIGVDet scheme. Code and dataset will be available at https://github.com/multimediaFor/AIGVDet.
☆ Distributed collaborative anomalous sound detection by embedding sharing
To develop a machine sound monitoring system, a method for detecting anomalous sound is proposed. In this paper, we explore a method for multiple clients to collaboratively learn an anomalous sound detection model while keeping their raw data private from each other. In the context of industrial machine anomalous sound detection, each client possesses data from different machines or different operational states, making it challenging to learn through federated learning or split learning. In our proposed method, each client calculates embeddings using a common pre-trained model developed for sound data classification, and these calculated embeddings are aggregated on the server to perform anomalous sound detection through outlier exposure. Experiments showed that our proposed method improves the AUC of anomalous sound detection by an average of 6.8%.
☆ Differentially Private Online Federated Learning with Correlated Noise
We propose a novel differentially private algorithm for online federated learning that employs temporally correlated noise to improve the utility while ensuring the privacy of the continuously released models. To address challenges stemming from DP noise and local updates with streaming noniid data, we develop a perturbed iterate analysis to control the impact of the DP noise on the utility. Moreover, we demonstrate how the drift errors from local updates can be effectively managed under a quasi-strong convexity condition. Subject to an $(\epsilon, \delta)$-DP budget, we establish a dynamic regret bound over the entire time horizon that quantifies the impact of key parameters and the intensity of changes in dynamic environments. Numerical experiments validate the efficacy of the proposed algorithm.
comment: 11 pages
☆ Let Real Images be as a Judger, Spotting Fake Images Synthesized with Generative Models
In the last few years, generative models have shown their powerful capabilities in synthesizing realistic images in both quality and diversity (i.e., facial images, and natural subjects). Unfortunately, the artifact patterns in fake images synthesized by different generative models are inconsistent, leading to the failure of previous research that relied on spotting subtle differences between real and fake. In our preliminary experiments, we find that the artifacts in fake images always change with the development of the generative model, while natural images exhibit stable statistical properties. In this paper, we employ natural traces shared only by real images as an additional predictive target in the detector. Specifically, the natural traces are learned from the wild real images and we introduce extended supervised contrastive learning to bring them closer to real images and further away from fake ones. This motivates the detector to make decisions based on the proximity of images to the natural traces. To conduct a comprehensive experiment, we built a high-quality and diverse dataset that includes generative models comprising 6 GAN and 6 diffusion models, to evaluate the effectiveness in generalizing unknown forgery techniques and robustness in surviving different transformations. Experimental results show that our proposed method gives 96.1% mAP significantly outperforms the baselines. Extensive experiments conducted on the widely recognized platform Midjourney reveal that our proposed method achieves an accuracy exceeding 78.4%, underscoring its practicality for real-world application deployment. The source code and partial self-built dataset are available in supplementary material.
☆ Plaintext-Free Deep Learning for Privacy-Preserving Medical Image Analysis via Frequency Information Embedding
In the fast-evolving field of medical image analysis, Deep Learning (DL)-based methods have achieved tremendous success. However, these methods require plaintext data for training and inference stages, raising privacy concerns, especially in the sensitive area of medical data. To tackle these concerns, this paper proposes a novel framework that uses surrogate images for analysis, eliminating the need for plaintext images. This approach is called Frequency-domain Exchange Style Fusion (FESF). The framework includes two main components: Image Hidden Module (IHM) and Image Quality Enhancement Module~(IQEM). The~IHM performs in the frequency domain, blending the features of plaintext medical images into host medical images, and then combines this with IQEM to improve and create surrogate images effectively. During the diagnostic model training process, only surrogate images are used, enabling anonymous analysis without any plaintext data during both training and inference stages. Extensive evaluations demonstrate that our framework effectively preserves the privacy of medical images and maintains diagnostic accuracy of DL models at a relatively high level, proving its effectiveness across various datasets and DL-based models.
☆ Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion
Modern neural networks are often trained on massive datasets that are web scraped with minimal human inspection. As a result of this insecure curation pipeline, an adversary can poison or backdoor the resulting model by uploading malicious data to the internet and waiting for a victim to scrape and train on it. Existing approaches for creating poisons and backdoors start with randomly sampled clean data, called base samples, and then modify those samples to craft poisons. However, some base samples may be significantly more amenable to poisoning than others. As a result, we may be able to craft more potent poisons by carefully choosing the base samples. In this work, we use guided diffusion to synthesize base samples from scratch that lead to significantly more potent poisons and backdoors than previous state-of-the-art attacks. Our Guided Diffusion Poisoning (GDP) base samples can be combined with any downstream poisoning or backdoor attack to boost its effectiveness. Our implementation code is publicly available at: https://github.com/hsouri/GDP .
☆ Machine learning for moduli space of genus two curves and an application to post-quantum cryptography
We use machine learning to study the locus ${\mathcal L}_n$ of genus two curves with $(n, n)$-split Jacobian. More precisely we design a transformer model which given values for the Igusa invariants determines if the corresponding genus two curve is in the locus ${\mathcal L}_n$, for $n=2, 3, 5, 7$. Such curves are important in isogeny based cryptography. During this study we discover that there are no rational points ${\mathfrak p} \in {\mathcal L}_n$ with weighted moduli height $\leq 2$ in any of ${\mathcal L}_2$, ${\mathcal L}_3$, and ${\mathcal L}_5$. This extends on previous work of the authors to use machine learning methods to study the moduli space of genus 2 algebraic curves.
comment: This is a short summary of extensive computations done in the moduli space of genus 2 curves with focus on curves with (n,n)-split Jacobians
☆ Measuring Compliance with the California Consumer Privacy Act Over Space and Time
The widespread sharing of consumers personal information with third parties raises significant privacy concerns. The California Consumer Privacy Act (CCPA) mandates that online businesses offer consumers the option to opt out of the sale and sharing of personal information. Our study automatically tracks the presence of the opt-out link longitudinally across multiple states after the California Privacy Rights Act (CPRA) went into effect. We categorize websites based on whether they are subject to CCPA and investigate cases of potential non-compliance. We find a number of websites that implement the opt-out link early and across all examined states but also find a significant number of CCPA-subject websites that fail to offer any opt-out methods even when CCPA is in effect. Our findings can shed light on how websites are reacting to the CCPA and identify potential gaps in compliance and opt- out method designs that hinder consumers from exercising CCPA opt-out rights.
☆ A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection
Large Language Models (LLMs) have demonstrated great potential for code generation and other software engineering tasks. Vulnerability detection is of crucial importance to maintaining the security, integrity, and trustworthiness of software systems. Precise vulnerability detection requires reasoning about the code, making it a good case study for exploring the limits of LLMs' reasoning capabilities. Although recent work has applied LLMs to vulnerability detection using generic prompting techniques, their full capabilities for this task and the types of errors they make when explaining identified vulnerabilities remain unclear. In this paper, we surveyed eleven LLMs that are state-of-the-art in code generation and commonly used as coding assistants, and evaluated their capabilities for vulnerability detection. We systematically searched for the best-performing prompts, incorporating techniques such as in-context learning and chain-of-thought, and proposed three of our own prompting methods. Our results show that while our prompting methods improved the models' performance, LLMs generally struggled with vulnerability detection. They reported 0.5-0.63 Balanced Accuracy and failed to distinguish between buggy and fixed versions of programs in 76% of cases on average. By comprehensively analyzing and categorizing 287 instances of model reasoning, we found that 57% of LLM responses contained errors, and the models frequently predicted incorrect locations of buggy code and misidentified bug types. LLMs only correctly localized 6 out of 27 bugs in DbgBench, and these 6 bugs were predicted correctly by 70-100% of human participants. These findings suggest that despite their potential for other tasks, LLMs may fail to properly comprehend critical code structures and security-related concepts. Our data and code are available at https://figshare.com/s/78fe02e56e09ec49300b.
☆ LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning CVPR 2024
Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function, such that the trigger can cause misclassification for any input. In response to this, recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent, they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience, we introduce a novel backdoor attack LOTUS. Specifically, it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore, LOTUS incorporates an effective trigger focusing mechanism, ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures, and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS.
comment: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)
☆ CYGENT: A cybersecurity conversational agent with log summarization powered by GPT-3
In response to the escalating cyber-attacks in the modern IT and IoT landscape, we developed CYGENT, a conversational agent framework powered by GPT-3.5 turbo model, designed to aid system administrators in ensuring optimal performance and uninterrupted resource availability. This study focuses on fine-tuning GPT-3 models for cybersecurity tasks, including conversational AI and generative AI tailored specifically for cybersecurity operations. CYGENT assists users by providing cybersecurity information, analyzing and summarizing uploaded log files, detecting specific events, and delivering essential instructions. The conversational agent was developed based on the GPT-3.5 turbo model. We fine-tuned and validated summarizer models (GPT3) using manually generated data points. Using this approach, we achieved a BERTscore of over 97%, indicating GPT-3's enhanced capability in summarizing log files into human-readable formats and providing necessary information to users. Furthermore, we conducted a comparative analysis of GPT-3 models with other Large Language Models (LLMs), including CodeT5-small, CodeT5-base, and CodeT5-base-multi-sum, with the objective of analyzing log analysis techniques. Our analysis consistently demonstrated that Davinci (GPT-3) model outperformed all other LLMs, showcasing higher performance. These findings are crucial for improving human comprehension of logs, particularly in light of the increasing numbers of IoT devices. Additionally, our research suggests that the CodeT5-base-multi-sum model exhibits comparable performance to Davinci to some extent in summarizing logs, indicating its potential as an offline model for this task.
comment: 7 pages, 9 figures
☆ Task-Agnostic Detector for Insertion-Based Backdoor Attacks NAACL 2024
Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.
comment: Findings of NAACL 2024
☆ Stochastic Gradient Langevin Unlearning
``The right to be forgotten'' ensured by laws for user data privacy becomes increasingly important. Machine unlearning aims to efficiently remove the effect of certain data points on the trained model parameters so that it can be approximately the same as if one retrains the model from scratch. This work proposes stochastic gradient Langevin unlearning, the first unlearning framework based on noisy stochastic gradient descent (SGD) with privacy guarantees for approximate unlearning problems under convexity assumption. Our results show that mini-batch gradient updates provide a superior privacy-complexity trade-off compared to the full-batch counterpart. There are numerous algorithmic benefits of our unlearning approach, including complexity saving compared to retraining, and supporting sequential and batch unlearning. To examine the privacy-utility-complexity trade-off of our method, we conduct experiments on benchmark datasets compared against prior works. Our approach achieves a similar utility under the same privacy constraint while using $2\%$ and $10\%$ of the gradient computations compared with the state-of-the-art gradient-based approximate unlearning methods for mini-batch and full-batch settings, respectively.
comment: arXiv admin note: substantial text overlap with arXiv:2401.10371
☆ Machine Learning on Blockchain Data: A Systematic Mapping Study
Context: Blockchain technology has drawn growing attention in the literature and in practice. Blockchain technology generates considerable amounts of data and has thus been a topic of interest for Machine Learning (ML). Objective: The objective of this paper is to provide a comprehensive review of the state of the art on machine learning applied to blockchain data. This work aims to systematically identify, analyze, and classify the literature on ML applied to blockchain data. This will allow us to discover the fields where more effort should be placed in future research. Method: A systematic mapping study has been conducted to identify the relevant literature. Ultimately, 159 articles were selected and classified according to various dimensions, specifically, the domain use case, the blockchain, the data, and the machine learning models. Results: The majority of the papers (49.7%) fall within the Anomaly use case. Bitcoin (47.2%) was the blockchain that drew the most attention. A dataset consisting of more than 1.000.000 data points was used by 31.4% of the papers. And Classification (46.5%) was the ML task most applied to blockchain data. Conclusion: The results confirm that ML applied to blockchain data is a relevant and a growing topic of interest both in the literature and in practice. Nevertheless, some open challenges and gaps remain, which can lead to future research directions. Specifically, we identify novel machine learning algorithms, the lack of a standardization framework, blockchain scalability issues and cross-chain interactions as areas worth exploring in the future.
☆ Semantic Ranking for Automated Adversarial Technique Annotation in Security Text
We introduce a new method for extracting structured threat behaviors from threat intelligence text. Our method is based on a multi-stage ranking architecture that allows jointly optimizing for efficiency and effectiveness. Therefore, we believe this problem formulation better aligns with the real-world nature of the task considering the large number of adversary techniques and the extensive body of threat intelligence created by security analysts. Our findings show that the proposed system yields state-of-the-art performance results for this task. Results show that our method has a top-3 recall performance of 81\% in identifying the relevant technique among 193 top-level techniques. Our tests also demonstrate that our system performs significantly better (+40\%) than the widely used large language models when tested under a zero-shot setting.
☆ Bayesian Methods for Trust in Collaborative Multi-Agent Autonomy
Multi-agent, collaborative sensor fusion is a vital component of a multi-national intelligence toolkit. In safety-critical and/or contested environments, adversaries may infiltrate and compromise a number of agents. We analyze state of the art multi-target tracking algorithms under this compromised agent threat model. We prove that the track existence probability test ("track score") is significantly vulnerable to even small numbers of adversaries. To add security awareness, we design a trust estimation framework using hierarchical Bayesian updating. Our framework builds beliefs of trust on tracks and agents by mapping sensor measurements to trust pseudomeasurements (PSMs) and incorporating prior trust beliefs in a Bayesian context. In case studies, our trust estimation algorithm accurately estimates the trustworthiness of tracks/agents, subject to observability limitations.
☆ Multi-Agent Optimization for Safety Analysis of Cyber-Physical Systems: Position Paper
Failure Mode, Effects and Criticality Analysis (FMECA) is one of the safety analysis methods recommended by most of the international standards. The classical FMECA is made in a form of a table filled in either manually or by using safety analysis tools. In both cases, the design engineers have to choose the trade-offs between safety and other development constraints. In the case of complex cyber-physical systems (CPS) with thousands of specified constraints, this may lead to severe problems and significantly impact the overall criticality of CPS. In this paper, we propose to adopt optimization techniques to automate the decision making process conducted after FMECA of CPS. We describe a multi-agent based optimization method which extends classical FMECA for offering optimal solutions in terms of criticality and development constraints of CPS.
comment: 13 pages, 2 figures, 1 table, "2nd International Workshop on Emerging Ideas and Trends in Engineering of Cyber-Physical Systems, part of Cyber-Physical Systems Week, April 2015, Seattle, USA"
☆ Towards Secure and Trusted-by-Design Smart Contracts
Distributed immutable ledgers, or blockchains, allow the secure digitization of evidential transactions without relying on a trusted third-party. Evidential transactions involve the exchange of any form of physical evidence, such as money, birth certificate, visas, tickets, etc. Most of the time, evidential transactions occur in the context of complex procedures, called evidential protocols, among physical agents. The blockchain provides the mechanisms to transfer evidence, while smart contracts - programs executing within the blockchain in a decentralized and replicated fashion - allow encoding evidential protocols on top of a blockchain. As a smart contract foregoes trusted third-parties and runs on several machines anonymously, it constitutes a highly critical program that has to be secure and trusted-by-design. While most of the current smart contract languages focus on easy programmability, they do not directly address the need of guaranteeing trust and accountability, which becomes a significant issue when evidential protocols are encoded as smart contracts.
comment: 17 pages, 1 algorithm, The 29th Francophone Days of Application Languages - JFLA 2018
☆ Concerned with Data Contamination? Assessing Countermeasures in Code Language Model
Various techniques have been proposed to leverage the capabilities of code language models (CLMs) for SE tasks. While these techniques typically evaluate their effectiveness using publicly available datasets, the evaluation can be subject to data contamination threats where the evaluation datasets have already been used to train the concerned CLMs. This can significantly affect the reliability of the evaluation. Different countermeasures have been suggested to mitigate the data contamination threat. Countermeasures include using more recent data, curating new data, and refactoring existing data are introduced, yet it is unclear whether these countermeasures could really mitigate data contamination threats to model evaluation. To fill the gap, we systematically study to quantify the impacts of these countermeasures on CLMs' performance. To facilitate the study, we collected over 2 million Python functions with timestamps ranging from January 1st, 2018, to December 31st, 2023. The data created before the models' cut-off date are considered "contaminated data", while the data where the countermeasures are taken are regarded as "cleansed data". We study the impact of these countermeasures by investigating the difference in CLMs' performance on contaminated and cleansed data derived from different countermeasures. Our experiments yield several interesting observations. For instance, CLMs do not necessarily perform worse on data after the models' cut-off date; on the contrary, they sometimes perform better. In addition, refactoring did not always result in decreased performance; it could lead to improvements instead. Furthermore, existing metrics such as perplexity cannot distinguish contaminated/cleansed data. We hope that the results and observations could help deepen the understanding of CLMs' capabilities and inform the community about data contamination.
☆ CipherFormer: Efficient Transformer Private Inference with Low Round Complexity SC
There is a growing trend to outsource the inference task of large transformer models to cloud servers. However, this poses a severe threat to users' private data as they are exposed to cloud servers after uploading. Although several works attempted to provide private inference for transformer models, their hundreds of communication rounds limit the application scenarios. Motivated by the desire to minimize round complexity, we propose CipherFormer, a novel transformer private inference scheme using homomorphic encryption and garbled circuits. We present a protocol for quickly computing homomorphic matrix multiplications. We then modify the attention mechanism and design the corresponding garbled circuits. Furthermore, we show how to use a lightweight attention mechanism and mixed-bitwidth to reduce the inference latency while maintaining accuracy. In comparison with an advanced homomorphic encryption scheme on text classification tasks, our model improves accuracy by 3% to 11% while performing private inference with a 7.7x-11.9x speedup.
comment: Accepted by CSCWD 2024 (27th International Conference on Computer Supported Cooperative Work in Design)
☆ Bi-objective Optimization in Role Mining
Role mining is a technique used to derive a role-based authorization policy from an existing policy. Given a set of users $U$, a set of permissions $P$ and a user-permission authorization relation $\mahtit{UPA}\subseteq U\times P$, a role mining algorithm seeks to compute a set of roles $R$, a user-role authorization relation $\mathit{UA}\subseteq U\times R$ and a permission-role authorization relation $\mathit{PA}\subseteq R\times P$, such that the composition of $\mathit{UA}$ and $\mathit{PA}$ is close (in some appropriate sense) to $\mathit{UPA}$. In this paper, we first introduce the Generalized Noise Role Mining problem (GNRM) -- a generalization of the MinNoise Role Mining problem -- which we believe has considerable practical relevance. Extending work of Fomin et al., we show that GNRM is fixed parameter tractable, with parameter $r + k$, where $r$ is the number of roles in the solution and $k$ is the number of discrepancies between $\mathit{UPA}$ and the relation defined by the composition of $\mathit{UA}$ and $\mathit{PA}$. We further introduce a bi-objective optimization variant of GNRM, where we wish to minimize both $r$ and $k$ subject to upper bounds $r\le \bar{r}$ and $k\le \bar{k}$, where $\bar{r}$ and $\bar{k}$ are constants. We show that the Pareto front of this bi-objective optimization problem (BO-GNRM) can be computed in fixed-parameter tractable time with parameter $\bar{r}+\bar{k}$. We then report the results of our experimental work using the integer programming solver Gurobi to solve instances of BO-GNRM. Our key findings are that (a) we obtained strong support that Gurobi's performance is fixed-parameter tractable, (b) our results suggest that our techniques may be useful for role mining in practice, based on our experiments in the context of three well-known real-world authorization policies.
☆ Deciphering the Interplay between Local Differential Privacy, Average Bayesian Privacy, and Maximum Bayesian Privacy
The swift evolution of machine learning has led to emergence of various definitions of privacy due to the threats it poses to privacy, including the concept of local differential privacy (LDP). Although widely embraced and utilized across numerous domains, this conventional approach to measure privacy still exhibits certain limitations, spanning from failure to prevent inferential disclosure to lack of consideration for the adversary's background knowledge. In this comprehensive study, we introduce Bayesian privacy and delve into the intricate relationship between local differential privacy and its Bayesian counterparts, unveiling novel insights into utility-privacy trade-offs. We introduce a framework that encapsulates both attack and defense strategies, highlighting their interplay and effectiveness. Our theoretical contributions are anchored in the rigorous definitions and relationships between Average Bayesian Privacy (ABP) and Maximum Bayesian Privacy (MBP), encapsulated by equations $\epsilon_{p,a} \leq \frac{1}{\sqrt{2}}\sqrt{(\epsilon_{p,m} + \epsilon)\cdot(e^{\epsilon_{p,m} + \epsilon} - 1)}$ and the equivalence between $\xi$-MBP and $2\xi$-LDP established under uniform prior distribution. These relationships fortify our understanding of the privacy guarantees provided by various mechanisms, leading to the realization that a mechanism satisfying $\xi$-LDP also confers $\xi$-MBP, and vice versa. Our work not only lays the groundwork for future empirical exploration but also promises to enhance the design of privacy-preserving algorithms that do not compromise on utility, thereby fostering the development of trustworthy machine learning solutions.
☆ Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors
Explainable Artificial Intelligence (XAI) strategies play a crucial part in increasing the understanding and trustworthiness of neural networks. Nonetheless, these techniques could potentially generate misleading explanations. Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation, providing misleading information by adding visually unnoticeable artifacts into the input, while maintaining the model's accuracy. It poses a serious challenge in ensuring the reliability of XAI methods. To ensure the reliability of XAI methods poses a real challenge, we leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks. We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase, avoiding the need for extra training. The method we suggest defences against most modern explanation-aware adversarial attacks, achieving an approximate decrease of ~99\% in the Attack Success Rate (ASR) and a ~91\% reduction in the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.
☆ Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models IJCNN
The integration of an ensemble of deep learning models has been extensively explored to enhance defense against adversarial attacks. The diversity among sub-models increases the attack cost required to deceive the majority of the ensemble, thereby improving the adversarial robustness. While existing approaches mainly center on increasing diversity in feature representations or dispersion of first-order gradients with respect to input, the limited correlation between these diversity metrics and adversarial robustness constrains the performance of ensemble adversarial defense. In this work, we aim to enhance ensemble diversity by reducing attack transferability. We identify second-order gradients, which depict the loss curvature, as a key factor in adversarial robustness. Computing the Hessian matrix involved in second-order gradients is computationally expensive. To address this, we approximate the Hessian-vector product using differential approximation. Given that low curvature provides better robustness, our ensemble model was designed to consider the influence of curvature among different sub-models. We introduce a novel regularizer to train multiple more-diverse low-curvature network models. Extensive experiments across various datasets demonstrate that our ensemble model exhibits superior robustness against a range of attacks, underscoring the effectiveness of our approach.
comment: Accepted to The 2024 International Joint Conference on Neural Networks (IJCNN)
♻ ☆ Enabling Physical Localization of Uncooperative Cellular Devices
In cellular networks, it can become necessary for authorities to physically locate user devices for tracking criminals or illegal devices. While cellular operators can provide authorities with cell information the device is camping on, fine-grained localization is still required. Therefore, the authorized agents trace the device by monitoring its uplink signals. However, tracking the uplink signal source without its cooperation is challenging even for operators and authorities. Particularly, three challenges remain for fine-grained localization: i) localization works only if devices generate enough uplink traffic reliably over time, ii) the target device might generate its uplink traffic with significantly low power, and iii) cellular repeater may add too much noise to true uplink signals. While these challenges present practical hurdles for localization, they have been overlooked in prior works. In this work, we investigate the impact of these real-world challenges on cellular localization and propose an Uncooperative Multiangulation Attack (UMA) that addresses these challenges. UMA can 1) force a target device to transmit traffic continuously, 2) boost the target's signal strength to the maximum, and 3) uniquely distinguish traffic from the target and the repeaters. Notably, the UMA technique works without privilege on cellular operators or user devices, which makes it operate on any LTE network. Our evaluations show that UMA effectively resolves the challenges in real-world environments when devices are not cooperative for localization. Our approach exploits the current cellular design vulnerabilities, which we have responsibly disclosed to GSMA.
♻ ☆ To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems? LREC
Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.
comment: Accepted at LREC-COLING 2024; final camera-ready version
♻ ☆ On the resilience of Collaborative Learning-based Recommender Systems Against Community Detection Attack
Collaborative-learning-based recommender systems emerged following the success of collaborative learning techniques such as Federated Learning (FL) and Gossip Learning (GL). In these systems, users participate in the training of a recommender system while maintaining their history of consumed items on their devices. While these solutions seemed appealing for preserving the privacy of the participants at first glance, recent studies have revealed that collaborative learning can be vulnerable to various privacy attacks. In this paper, we study the resilience of collaborative learning-based recommender systems against a novel privacy attack called Community Detection Attack (CDA). This attack enables an adversary to identify community members based on a chosen set of items (eg., identifying users interested in specific points-of-interest). Through experiments on three real recommendation datasets using two state-of-the-art recommendation models, we evaluate the sensitivity of an FL-based recommender system as well as two flavors of Gossip Learning-based recommender systems to CDA. The results show that across all models and datasets, the FL setting is more vulnerable to CDA compared to Gossip settings. Furthermore, we assess two off-the-shelf mitigation strategies, namely differential privacy (DP) and a \emph{Share less} policy, which consists of sharing a subset of less sensitive model parameters. The findings indicate a more favorable privacy-utility trade-off for the \emph{Share less} strategy, particularly in FedRecs.
♻ ☆ A Transfer Attack to Image Watermarks
Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In particular, multiple studies claimed that image watermark is robust in such setting. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
♻ ☆ Deep fused flow and topology features for botnet detection basing on pretrained GCN
Nowadays, botnets have become one of the major threats to cyber security. The characteristics of botnets are mainly reflected in bots network behavior and their intercommunication relationships. Existing botnet detection methods use flow features or topology features individually, which overlook the other type of feature. This affects model performance. In this paper, we propose a botnet detection model which uses graph convolutional network (GCN) to deeply fuse flow features and topology features for the first time. We construct communication graphs from network traffic and represent nodes with flow features. Due to the imbalance of existing public traffic flow datasets, it is impossible to train a GCN model on these datasets. Therefore, we use a balanced public communication graph dataset to pretrain a GCN model, thereby guaranteeing its capacity for identify topology features. We then feed the communication graph with flow features into the pretrained GCN. The output from the last hidden layer is treated as the fusion of flow and topology features. Additionally, by adjusting the number of layers in the GCN network, the model can effectively detect botnets under both C2 and P2P structures. Validated on the public ISCX2014 dataset, our approach achieves a remarkable recall rate 92.90% and F1-score 92.76% for C2 botnets, alongside recall rate 94.66% and F1-score of 92.35% for P2P botnets. These results not only demonstrate the effectiveness of our method, but also outperform the performance of the currently leading detection models.
♻ ☆ MMA-Diffusion: MultiModal Attack on Diffusion Models CVPR 2024
In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.
comment: CVPR 2024. Code is available at https://github.com/yangyijune/MMA-Diffusion
♻ ☆ Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection
This study conducts a thorough examination of malware detection using machine learning techniques, focusing on the evaluation of various classification models using the Mal-API-2019 dataset. The aim is to advance cybersecurity capabilities by identifying and mitigating threats more effectively. Both ensemble and non-ensemble machine learning methods, such as Random Forest, XGBoost, K Nearest Neighbor (KNN), and Neural Networks, are explored. Special emphasis is placed on the importance of data pre-processing techniques, particularly TF-IDF representation and Principal Component Analysis, in improving model performance. Results indicate that ensemble methods, particularly Random Forest and XGBoost, exhibit superior accuracy, precision, and recall compared to others, highlighting their effectiveness in malware detection. The paper also discusses limitations and potential future directions, emphasizing the need for continuous adaptation to address the evolving nature of malware. This research contributes to ongoing discussions in cybersecurity and provides practical insights for developing more robust malware detection systems in the digital era.
♻ ☆ InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.
comment: 28 pages, 5 figures, 9 tables
♻ ☆ Blockchain-Envisioned Post-Quantum Secure Sanitizable Signature for Audit Logs Management
Audit logs are one of the most important tools for transparently tracking system events and maintaining continuous oversight in corporate organizations and enterprise business systems. There are many cases where the audit logs contain sensitive data, or the audit logs are enormous. In these situations, dealing with a subset of the data is more practical than the entire data set. To provide a secure solution to handle these issues, a sanitizable signature scheme (SSS) is a viable cryptographic primitive. Herein, we first present the first post-quantum secure multivariate-based SSS, namely Mul-SAN. Our proposed design provides unforgeability, privacy, immutability, signer accountability, and sanitizer accountability under the assumption that the MQ problem is NP-hard. Mul-SAN is very efficient and only requires computing field multiplications and additions over a finite field for its implementation. Mul-SAN presents itself as a practical method to partially delegate control of the authenticated data in avenues like the healthcare industry and government organizations. We also explore using Blockchain to provide a tamper-proof and robust audit log mechanism.
♻ ☆ RSA+: An RSA variant
We introduce a new probabilistic public-key cryptosystem which combines the main ingredients of the well-known RSA and Rabin cryptosystems. We investigate the security and performance of our new scheme in comparison to the other two.
comment: 11 pages, no figures; code on GitHub (https://github.com/soeren-kleine/RSA_plus-an-RSA-variant)
Computation and Language
☆ ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search LREC
Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.
comment: Accepted to LREC-COLING 2024
☆ ToXCL: A Unified Framework for Toxic Speech Detection and Explanation NAACL 2024
The proliferation of online toxic speech is a pertinent problem posing threats to demographic groups. While explicit toxic speech contains offensive lexical signals, implicit one consists of coded or indirect language. Therefore, it is crucial for models not only to detect implicit toxic speech but also to explain its toxicity. This draws a unique need for unified frameworks that can effectively detect and explain implicit toxic speech. Prior works mainly formulated the task of toxic speech detection and explanation as a text generation problem. Nonetheless, models trained using this strategy can be prone to suffer from the consequent error propagation problem. Moreover, our experiments reveal that the detection results of such models are much lower than those that focus only on the detection task. To bridge these gaps, we introduce ToXCL, a unified framework for the detection and explanation of implicit toxic speech. Our model consists of three modules: a (i) Target Group Generator to generate the targeted demographic group(s) of a given post; an (ii) Encoder-Decoder Model in which the encoder focuses on detecting implicit toxic speech and is boosted by a (iii) Teacher Classifier via knowledge distillation, and the decoder generates the necessary explanation. ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly.
comment: Accepted at NAACL 2024 (Main Conference)
☆ Who is bragging more online? A large scale analysis of bragging in social media LREC
Bragging is the act of uttering statements that are likely to be positively viewed by others and it is extensively employed in human communication with the aim to build a positive self-image of oneself. Social media is a natural platform for users to employ bragging in order to gain admiration, respect, attention and followers from their audiences. Yet, little is known about the scale of bragging online and its characteristics. This paper employs computational sociolinguistics methods to conduct the first large scale study of bragging behavior on Twitter (U.S.) by focusing on its overall prevalence, temporal dynamics and impact of demographic factors. Our study shows that the prevalence of bragging decreases over time within the same population of users. In addition, younger, more educated and popular users in the U.S. are more likely to brag. Finally, we conduct an extensive linguistics analysis to unveil specific bragging themes associated with different user traits.
comment: Accepted at LREC-COLING 2024
☆ RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict
Fact-checking is the task of verifying the factuality of a given claim by examining the available evidence. High-quality evidence plays a vital role in enhancing fact-checking systems and facilitating the generation of explanations that are understandable to humans. However, the provision of both sufficient and relevant evidence for explainable fact-checking systems poses a challenge. To tackle this challenge, we propose a method based on a Large Language Model to automatically retrieve and summarize evidence from the Web. Furthermore, we construct RU22Fact, a novel multilingual explainable fact-checking dataset on the Russia-Ukraine conflict in 2022 of 16K samples, each containing real-world claims, optimized evidence, and referenced explanation. To establish a baseline for our dataset, we also develop an end-to-end explainable fact-checking system to verify claims and generate explanations. Experimental results demonstrate the prospect of optimized evidence in increasing fact-checking performance and also indicate the possibility of further progress in the end-to-end claim verification and explanation generation tasks.
comment: 12 pages, 3 figures, accepted by lrec-coling2024
☆ Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT
Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through the conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR and speech recognition are utilized to transform the images and speech signals into text content. All these variety of mechanisms of text generation also introduce errors into the captured text. This project aims at analyzing different kinds of error that occurs in text documents. The work employs two of the advanced deep neural network-based language models, namely, BART and MarianMT, to rectify the anomalies present in the text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both models can bring down the erroneous sentences by 20+%, BART can handle spelling errors far better (24.6%) than grammatical errors (8.8%).
☆ A comparative analysis of embedding models for patent similarity
This paper makes two contributions to the field of text-based patent similarity. First, it compares the performance of different kinds of patent-specific pretrained embedding models, namely static word embeddings (such as word2vec and doc2vec models) and contextual word embeddings (such as transformers based models), on the task of patent similarity calculation. Second, it compares specifically the performance of Sentence Transformers (SBERT) architectures with different training phases on the patent similarity task. To assess the models' performance, we use information about patent interferences, a phenomenon in which two or more patent claims belonging to different patent applications are proven to be overlapping by patent examiners. Therefore, we use these interferences cases as a proxy for maximum similarity between two patents, treating them as ground-truth to evaluate the performance of the different embedding models. Our results point out that, first, Patent SBERT-adapt-ub, the domain adaptation of the pretrained Sentence Transformer architecture proposed in this research, outperforms the current state-of-the-art in patent similarity. Second, they show that, in some cases, large static models performances are still comparable to contextual ones when trained on extensive data; thus, we believe that the superiority in the performance of contextual embeddings may not be related to the actual architecture but rather to the way the training phase is performed.
☆ Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts SC
Tasks such as semantic search and clustering on crisis-related social media texts enhance our comprehension of crisis discourse, aiding decision-making and targeted interventions. Pre-trained language models have advanced performance in crisis informatics, but their contextual embeddings lack semantic meaningfulness. Although the CrisisTransformers family includes a sentence encoder to address the semanticity issue, it remains monolingual, processing only English texts. Furthermore, employing separate models for different languages leads to embeddings in distinct vector spaces, introducing challenges when comparing semantic similarities between multi-lingual texts. Therefore, we propose multi-lingual sentence encoders (CT-XLMR-SE and CT-mBERT-SE) that embed crisis-related social media texts for over 50 languages, such that texts with similar meanings are in close proximity within the same vector space, irrespective of language diversity. Results in sentence encoding and sentence matching tasks are promising, suggesting these models could serve as robust baselines when embedding multi-lingual crisis-related social media texts. The models are publicly available at: https://huggingface.co/crisistransformers.
comment: Accepted to ISCRAM 2024
☆ Conversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units
Successful conversations often rest on common understanding, where all parties are on the same page about the information being shared. This process, known as conversational grounding, is crucial for building trustworthy dialog systems that can accurately keep track of and recall the shared information. The proficiencies of an agent in grounding the conveyed information significantly contribute to building a reliable dialog system. Despite recent advancements in dialog systems, there exists a noticeable deficit in their grounding capabilities. Traum provided a framework for conversational grounding introducing Grounding Acts and Grounding Units, but substantial progress, especially in the realm of Large Language Models, remains lacking. To bridge this gap, we present the annotation of two dialog corpora employing Grounding Acts, Grounding Units, and a measure of their degree of grounding. We discuss our key findings during the annotation and also provide a baseline model to test the performance of current Language Models in categorizing the grounding acts of the dialogs. Our work aims to provide a useful resource for further research in making conversations with machines better understood and more reliable in natural day-to-day collaborative dialogs.
☆ TrustAI at SemEval-2024 Task 8: A Comprehensive Analysis of Multi-domain Machine Generated Text Detection Techniques
The Large Language Models (LLMs) exhibit remarkable ability to generate fluent content across a wide spectrum of user queries. However, this capability has raised concerns regarding misinformation and personal information leakage. In this paper, we present our methods for the SemEval2024 Task8, aiming to detect machine-generated text across various domains in both mono-lingual and multi-lingual contexts. Our study comprehensively analyzes various methods to detect machine-generated text, including statistical, neural, and pre-trained model approaches. We also detail our experimental setup and perform a in-depth error analysis to evaluate the effectiveness of these methods. Our methods obtain an accuracy of 86.9\% on the test set of subtask-A mono and 83.7\% for subtask-B. Furthermore, we also highlight the challenges and essential factors for consideration in future studies.
comment: 8 pages, 1 Figure
☆ Can Large Language Models (or Humans) Distill Text?
We investigate the potential of large language models (LLMs) to distill text: to remove the textual traces of an undesired forbidden variable. We employ a range of LLMs with varying architectures and training approaches to distill text by identifying and removing information about the target variable while preserving other relevant signals. Our findings shed light on the strengths and limitations of LLMs in addressing the distillation and provide insights into the strategies for leveraging these models in computational social science investigations involving text data. In particular, we show that in the strong test of removing sentiment, the statistical association between the processed text and sentiment is still clearly detectable to machine learning classifiers post-LLM-distillation. Furthermore, we find that human annotators also struggle to distill sentiment while preserving other semantic content. This suggests there may be limited separability between concept variables in some text contexts, highlighting limitations of methods relying on text-level transformations and also raising questions about the robustness of distillation methods that achieve statistical independence in representation space if this is difficult for human coders operating on raw text to attain.
☆ NSINA: A News Corpus for Sinhala LREC
The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.
comment: Accepted to LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)
☆ PE: A Poincare Explanation Method for Fast Text Hierarchy Generation
The black-box nature of deep learning models in NLP hinders their widespread application. The research focus has shifted to Hierarchical Attribution (HA) for its ability to model feature interactions. Recent works model non-contiguous combinations with a time-costly greedy search in Eculidean spaces, neglecting underlying linguistic information in feature representations. In this work, we introduce a novel method, namely Poincar\'e Explanation (PE), for modeling feature interactions using hyperbolic spaces in an $O(n^2logn)$ time complexity. Inspired by Poincar\'e model, we propose a framework to project the embeddings into hyperbolic spaces, which exhibit better inductive biases for syntax and semantic hierarchical structures. Eventually, we prove that the hierarchical clustering process in the projected space could be viewed as building a minimum spanning tree and propose a time efficient algorithm. Experimental results demonstrate the effectiveness of our approach.
comment: 12 pages, 10 figures
☆ Efficient Information Extraction in Few-Shot Relation Classification through Contrastive Representation Learning NAACL 2024
Differentiating relationships between entity pairs with limited labeled instances poses a significant challenge in few-shot relation classification. Representations of textual data extract rich information spanning the domain, entities, and relations. In this paper, we introduce a novel approach to enhance information extraction combining multiple sentence representations and contrastive learning. While representations in relation classification are commonly extracted using entity marker tokens, we argue that substantial information within the internal model representations remains untapped. To address this, we propose aligning multiple sentence representations, such as the [CLS] token, the [MASK] token used in prompting, and entity marker tokens. Our method employs contrastive learning to extract complementary discriminative information from these individual representations. This is particularly relevant in low-resource settings where information is scarce. Leveraging multiple sentence representations is especially effective in distilling discriminative information for relation classification when additional information, like relation descriptions, are not available. We validate the adaptability of our approach, maintaining robust performance in scenarios that include relation descriptions, and showcasing its flexibility to adapt to different resource constraints.
comment: NAACL 2024
☆ Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art
Autonomous systems are soon to be ubiquitous, from manufacturing autonomy to agricultural field robots, and from health care assistants to the entertainment industry. The majority of these systems are developed with modular sub-components for decision-making, planning, and control that may be hand-engineered or learning-based. While these existing approaches have been shown to perform well under the situations they were specifically designed for, they can perform especially poorly in rare, out-of-distribution scenarios that will undoubtedly arise at test-time. The rise of foundation models trained on multiple tasks with impressively large datasets from a variety of fields has led researchers to believe that these models may provide common sense reasoning that existing planners are missing. Researchers posit that this common sense reasoning will bridge the gap between algorithm development and deployment to out-of-distribution tasks, like how humans adapt to unexpected scenarios. Large language models have already penetrated the robotics and autonomous systems domains as researchers are scrambling to showcase their potential use cases in deployment. While this application direction is very promising empirically, foundation models are known to hallucinate and generate decisions that may sound reasonable, but are in fact poor. We argue there is a need to step back and simultaneously design systems that can quantify the certainty of a model's decision, and detect when it may be hallucinating. In this work, we discuss the current use cases of foundation models for decision-making tasks, provide a general definition for hallucinations with examples, discuss existing approaches to hallucination detection and mitigation with a focus on decision problems, and explore areas for further research in this exciting field.
comment: 31 pages, 2 tables
☆ Visually Guided Generative Text-Layout Pre-training for Document Intelligence NAACL 2024
Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.
comment: Accepted to NAACL 2024 main conference. The first version of this paper was submitted to OpenReview (https://openreview.net/forum?id=ARtBIBAmNR) in June 2023
☆ LLMs Are Few-Shot In-Context Low-Resource Language Learners
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages.
☆ LARA: Linguistic-Adaptive Retrieval-Augmented LLMs for Multi-Turn Intent Classification
Following the significant achievements of large language models (LLMs), researchers have employed in-context learning for text classification tasks. However, these studies focused on monolingual, single-turn classification tasks. In this paper, we introduce LARA (Linguistic-Adaptive Retrieval-Augmented Language Models), designed to enhance accuracy in multi-turn classification tasks across six languages, accommodating numerous intents in chatbot interactions. Multi-turn intent classification is notably challenging due to the complexity and evolving nature of conversational contexts. LARA tackles these issues by combining a fine-tuned smaller model with a retrieval-augmented mechanism, integrated within the architecture of LLMs. This integration allows LARA to dynamically utilize past dialogues and relevant intents, thereby improving the understanding of the context. Furthermore, our adaptive retrieval techniques bolster the cross-lingual capabilities of LLMs without extensive retraining and fine-tune. Comprehensive experiments demonstrate that LARA achieves state-of-the-art performance on multi-turn intent classification tasks, enhancing the average accuracy by 3.67% compared to existing methods.
☆ Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks LREC
Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.
comment: LREC-COLING 2024
☆ Few-shot Named Entity Recognition via Superposition Concept Discrimination LREC
Few-shot NER aims to identify entities of target types with only limited number of illustrative instances. Unfortunately, few-shot NER is severely challenged by the intrinsic precise generalization problem, i.e., it is hard to accurately determine the desired target type due to the ambiguity stemming from information deficiency. In this paper, we propose Superposition Concept Discriminator (SuperCD), which resolves the above challenge via an active learning paradigm. Specifically, a concept extractor is first introduced to identify superposition concepts from illustrative instances, with each concept corresponding to a possible generalization boundary. Then a superposition instance retriever is applied to retrieve corresponding instances of these superposition concepts from large-scale text corpus. Finally, annotators are asked to annotate the retrieved instances and these annotated instances together with original illustrative instances are used to learn FS-NER models. To this end, we learn a universal concept extractor and superposition instance retriever using a large-scale openly available knowledge bases. Experiments show that SuperCD can effectively identify superposition concepts from illustrative instances, retrieve superposition instances from large-scale corpus, and significantly improve the few-shot NER performance with minimal additional efforts.
comment: Accepted to LREC-COLING 2024
☆ A Study on How Attention Scores in the BERT Model are Aware of Lexical Categories in Syntactic and Semantic Tasks on the GLUE Benchmark
This study examines whether the attention scores between tokens in the BERT model significantly vary based on lexical categories during the fine-tuning process for downstream tasks. Drawing inspiration from the notion that in human language processing, syntactic and semantic information is parsed differently, we categorize tokens in sentences according to their lexical categories and focus on changes in attention scores among these categories. Our hypothesis posits that in downstream tasks that prioritize semantic information, attention scores centered on content words are enhanced, while in cases emphasizing syntactic information, attention scores centered on function words are intensified. Through experimentation conducted on six tasks from the GLUE benchmark dataset, we substantiate our hypothesis regarding the fine-tuning process. Furthermore, our additional investigations reveal the presence of BERT layers that consistently assign more bias to specific lexical categories, irrespective of the task, highlighting the existence of task-agnostic lexical category preferences.
☆ Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm
Large language models (LLMs) are gaining increasing interests to improve clinical efficiency for medical diagnosis, owing to their unprecedented performance in modelling natural language. Ensuring the safe and reliable clinical applications, the evaluation of LLMs indeed becomes critical for better mitigating the potential risks, e.g., hallucinations. However, current evaluation methods heavily rely on labor-intensive human participation to achieve human-preferred judgements. To overcome this challenge, we propose an automatic evaluation paradigm tailored to assess the LLMs' capabilities in delivering clinical services, e.g., disease diagnosis and treatment. The evaluation paradigm contains three basic elements: metric, data, and algorithm. Specifically, inspired by professional clinical practice pathways, we formulate a LLM-specific clinical pathway (LCP) to define the clinical capabilities that a doctor agent should possess. Then, Standardized Patients (SPs) from the medical education are introduced as the guideline for collecting medical data for evaluation, which can well ensure the completeness of the evaluation procedure. Leveraging these steps, we develop a multi-agent framework to simulate the interactive environment between SPs and a doctor agent, which is equipped with a Retrieval-Augmented Evaluation (RAE) to determine whether the behaviors of a doctor agent are in accordance with LCP. The above paradigm can be extended to any similar clinical scenarios to automatically evaluate the LLMs' medical capabilities. Applying such paradigm, we construct an evaluation benchmark in the field of urology, including a LCP, a SPs dataset, and an automated RAE. Extensive experiments are conducted to demonstrate the effectiveness of the proposed approach, providing more insights for LLMs' safe and reliable deployments in clinical practice.
☆ KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models
Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce \textit{KIT-19} as an instruction dataset for the development of LLM in Korean. \textit{KIT-19} is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using \textit{KIT-19} to demonstrate its effectiveness. The experimental results show that the model trained on \textit{KIT-19} significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that \textit{KIT-19} has the potential to make a substantial contribution to the future improvement of Korean LLMs' performance.
☆ CodeS: Natural Language to Code Repository via Multi-Layer Sketch
The impressive performance of large language models (LLMs) on code-related tasks has shown the potential of fully automated software development. In light of this, we introduce a new software engineering task, namely Natural Language to code Repository (NL2Repo). This task aims to generate an entire code repository from its natural language requirements. To address this task, we propose a simple yet effective framework CodeS, which decomposes NL2Repo into multiple sub-tasks by a multi-layer sketch. Specifically, CodeS includes three modules: RepoSketcher, FileSketcher, and SketchFiller. RepoSketcher first generates a repository's directory structure for given requirements; FileSketcher then generates a file sketch for each file in the generated structure; SketchFiller finally fills in the details for each function in the generated file sketch. To rigorously assess CodeS on the NL2Repo task, we carry out evaluations through both automated benchmarking and manual feedback analysis. For benchmark-based evaluation, we craft a repository-oriented benchmark, SketchEval, and design an evaluation metric, SketchBLEU. For feedback-based evaluation, we develop a VSCode plugin for CodeS and engage 30 participants in conducting empirical studies. Extensive experiments prove the effectiveness and practicality of CodeS on the NL2Repo task.
comment: https://github.com/NL2Code/CodeS
☆ If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize important textual features for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate the important features for the VLM. Then, we inspect the descriptions to identify the features that contribute to VLM representations. We find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
comment: Code: https://github.com/BatsResearch/ex2
☆ Evaluating Large Language Models with Runtime Behavior of Program Execution
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs.
☆ InstUPR : Instruction-based Unsupervised Passage Reranking with Large Language Models
This paper introduces InstUPR, an unsupervised passage reranking method based on large language models (LLMs). Different from existing approaches that rely on extensive training with query-document pairs or retrieval-specific instructions, our method leverages the instruction-following capabilities of instruction-tuned LLMs for passage reranking without any additional fine-tuning. To achieve this, we introduce a soft score aggregation technique and employ pairwise reranking for unsupervised passage reranking. Experiments on the BEIR benchmark demonstrate that InstUPR outperforms unsupervised baselines as well as an instruction-tuned reranker, highlighting its effectiveness and superiority. Source code to reproduce all experiments is open-sourced at https://github.com/MiuLab/InstUPR
comment: Preprint. This manuscript was originally written and submitted in June 2023
☆ $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models NAACL2024
Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks. Instead of using a fixed prompt template to fine-tune the model, some research demonstrates the effectiveness of searching for the prompt via optimization. Such prompt optimization process of prompt-based learning on PLMs also gives insight into generating adversarial prompts to mislead the model, raising concerns about the adversarial vulnerability of this paradigm. Recent studies have shown that universal adversarial triggers (UATs) can be generated to alter not only the predictions of the target PLMs but also the prediction of corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based learning paradigm. However, UATs found in previous works are often unreadable tokens or characters and can be easily distinguished from natural texts with adaptive defenses. In this work, we consider the naturalness of the UATs and develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs by a gradient-based beam search algorithm that not only effectively attacks the target PLMs and PFMs but also maintains the naturalness among the trigger tokens. Extensive results demonstrate the effectiveness of $\textit{LinkPrompt}$, as well as the transferability of UATs generated by \textit{LinkPrompt} to open-sourced Large Language Model (LLM) Llama2 and API-accessed LLM GPT-3.5-turbo.
comment: Accepted to the main conference of NAACL2024
☆ Is There a One-Model-Fits-All Approach to Information Extraction? Revisiting Task Definition Biases
Definition bias is a negative phenomenon that can mislead models. Definition bias in information extraction appears not only across datasets from different domains but also within datasets sharing the same domain. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. To systematically investigate definition bias, we conduct three probing experiments to quantitatively analyze it and discover the limitations of unified information extraction and large language models in solving definition bias. To mitigate definition bias in information extraction, we propose a multi-stage framework consisting of definition bias measurement, bias-aware fine-tuning, and task-specific bias mitigation. Experimental results demonstrate the effectiveness of our framework in addressing definition bias. Resources of this paper can be found at https://github.com/EZ-hwh/definition-bias
comment: 15 pages, 4 figures
☆ Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
The literature on text-to-image generation is plagued by issues of faithfully composing entities with relations. But there lacks a formal understanding of how entity-relation compositions can be effectively learned. Moreover, the underlying phenomenon space that meaningfully reflects the problem structure is not well-defined, leading to an arms race for larger quantities of data in the hope that generalization emerges out of large-scale pretraining. We hypothesize that the underlying phenomenological coverage has not been proportionally scaled up, leading to a skew of the presented phenomenon which harms generalization. We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning, and show that generalization failures of text-to-image generation are a direct result of incomplete or unbalanced phenomenological coverage. We first perform experiments in a synthetic domain and demonstrate that systematically controlled metrics are strongly predictive of generalization performance. Then we move to natural images and show that simple distribution perturbations in light of our theories boost generalization without enlarging the absolute data size. This work informs an important direction towards quality-enhancing the data diversity or balance orthogonal to scaling up the absolute size. Our discussions point out important open questions on 1) Evaluation of generated entity-relation compositions, and 2) Better models for reasoning with abstract relations.
☆ Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA CVPR 2024
Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work, we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs), which have shown to have strong reasoning ability, as an automatic data annotator that generates question-answer annotations for chart images. The key innovation in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data generator learns to decompose the complex question into step-by-step sub-questions (rationales), which are then used to derive the final answer using external tools, i.e. Python. This step-wise generation procedure is trained on synthetic data generated using a template-based QA generation pipeline. Experimental results highlight the significance of the proposed step-by-step generation. By training with the LLM-augmented data (LAMENDA), we significantly enhance the chart VQA models, achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets. In particular, our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset, which needs strong reasoning. We hope our work underscores the potential of synthetic data and encourages further exploration of data augmentation using LLMs for reasoning-heavy tasks.
comment: Accepted to CVPR 2024
☆ Enhanced Facet Generation with LLM Editing LREC
In information retrieval, facet identification of a user query is an important task. If a search service can recognize the facets of a user's query, it has the potential to offer users a much broader range of search results. Previous studies can enhance facet prediction by leveraging retrieved documents and related queries obtained through a search engine. However, there are challenges in extending it to other applications when a search engine operates as part of the model. First, search engines are constantly updated. Therefore, additional information may change during training and test, which may reduce performance. The second challenge is that public search engines cannot search for internal documents. Therefore, a separate search system needs to be built to incorporate documents from private domains within the company. We propose two strategies that focus on a framework that can predict facets by taking only queries as input without a search engine. The first strategy is multi-task learning to predict SERP. By leveraging SERP as a target instead of a source, the proposed model deeply understands queries without relying on external modules. The second strategy is to enhance the facets by combining Large Language Model (LLM) and the small model. Overall performance improves when small model and LLM are combined rather than facet generation individually.
comment: Accepted at LREC-COLING 2024
☆ A Hybrid Approach To Aspect Based Sentiment Analysis Using Transfer Learning
Aspect-Based Sentiment Analysis (ABSA) aims to identify terms or multiword expressions (MWEs) on which sentiments are expressed and the sentiment polarities associated with them. The development of supervised models has been at the forefront of research in this area. However, training these models requires the availability of manually annotated datasets which is both expensive and time-consuming. Furthermore, the available annotated datasets are tailored to a specific domain, language, and text type. In this work, we address this notable challenge in current state-of-the-art ABSA research. We propose a hybrid approach for Aspect Based Sentiment Analysis using transfer learning. The approach focuses on generating weakly-supervised annotations by exploiting the strengths of both large language models (LLM) and traditional syntactic dependencies. We utilise syntactic dependency structures of sentences to complement the annotations generated by LLMs, as they may overlook domain-specific aspect terms. Extensive experimentation on multiple datasets is performed to demonstrate the efficacy of our hybrid method for the tasks of aspect term extraction and aspect sentiment classification. Keywords: Aspect Based Sentiment Analysis, Syntactic Parsing, large language model (LLM)
☆ TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models
Classical planning formulations like the Planning Domain Definition Language (PDDL) admit action sequences guaranteed to achieve a goal state given an initial state if any are possible. However, reasoning problems defined in PDDL do not capture temporal aspects of action taking, for example that two agents in the domain can execute an action simultaneously if postconditions of each do not interfere with preconditions of the other. A human expert can decompose a goal into largely independent constituent parts and assign each agent to one of these subgoals to take advantage of simultaneous actions for faster execution of plan steps, each using only single agent planning. By contrast, large language models (LLMs) used for directly inferring plan steps do not guarantee execution success, but do leverage commonsense reasoning to assemble action sequences. We combine the strengths of classical planning and LLMs by approximating human intuitions for two-agent planning goal decomposition. We demonstrate that LLM-based goal decomposition leads to faster planning times than solving multi-agent PDDL problems directly while simultaneously achieving fewer plan execution steps than a single agent plan alone and preserving execution success. Additionally, we find that LLM-based approximations of subgoals can achieve similar multi-agent execution steps than those specified by human experts. Website and resources at https://glamor-usc.github.io/twostep
comment: 12 pages
☆ SPLICE: A Singleton-Enhanced PipeLIne for Coreference REsolution LREC
Singleton mentions, i.e.~entities mentioned only once in a text, are important to how humans understand discourse from a theoretical perspective. However previous attempts to incorporate their detection in end-to-end neural coreference resolution for English have been hampered by the lack of singleton mention spans in the OntoNotes benchmark. This paper addresses this limitation by combining predicted mentions from existing nested NER systems and features derived from OntoNotes syntax trees. With this approach, we create a near approximation of the OntoNotes dataset with all singleton mentions, achieving ~94% recall on a sample of gold singletons. We then propose a two-step neural mention and coreference resolution system, named SPLICE, and compare its performance to the end-to-end approach in two scenarios: the OntoNotes test set and the out-of-domain (OOD) OntoGUM corpus. Results indicate that reconstructed singleton training yields results comparable to end-to-end systems for OntoNotes, while improving OOD stability (+1.1 avg. F1). We conduct error analysis for mention detection and delve into its impact on coreference clustering, revealing that precision improvements deliver more substantial benefits than increases in recall for resolving coreference chains.
comment: Accepted to LREC-COLING 2024
☆ The Role of $n$-gram Smoothing in the Age of Neural Networks
For nearly three decades, language models derived from the $n$-gram assumption held the state of the art on the task. The key to their success lay in the application of various smoothing techniques that served to combat overfitting. However, when neural language models toppled $n$-gram models as the best performers, $n$-gram smoothing techniques became less relevant. Indeed, it would hardly be an understatement to suggest that the line of inquiry into $n$-gram smoothing techniques became dormant. This paper re-opens the role classical $n$-gram smoothing techniques may play in the age of neural language models. First, we draw a formal equivalence between label smoothing, a popular regularization technique for neural language models, and add-$\lambda$ smoothing. Second, we derive a generalized framework for converting \emph{any} $n$-gram smoothing technique into a regularizer compatible with neural language models. Our empirical results find that our novel regularizers are comparable to and, indeed, sometimes outperform label smoothing on language modeling and machine translation.
☆ Making Sentence Embeddings Robust to User-Generated Content LREC
NLP models have been known to perform poorly on user-generated content (UGC), mainly because it presents a lot of lexical variations and deviates from the standard texts on which most of these models were trained. In this work, we focus on the robustness of LASER, a sentence embedding model, to UGC data. We evaluate this robustness by LASER's ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We show that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER's robustness to both natural and artificial UGC data by achieving up to 2x and 11x better scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.
comment: Accepted at LREC-COLING 2024
☆ Ontology Completion with Natural Language Inference and Concept Embeddings: An Analysis
We consider the problem of finding plausible knowledge that is missing from a given ontology, as a generalisation of the well-studied taxonomy expansion task. One line of work treats this task as a Natural Language Inference (NLI) problem, thus relying on the knowledge captured by language models to identify the missing knowledge. Another line of work uses concept embeddings to identify what different concepts have in common, taking inspiration from cognitive models for category based induction. These two approaches are intuitively complementary, but their effectiveness has not yet been compared. In this paper, we introduce a benchmark for evaluating ontology completion methods and thoroughly analyse the strengths and weaknesses of both approaches. We find that both approaches are indeed complementary, with hybrid strategies achieving the best overall results. We also find that the task is highly challenging for Large Language Models, even after fine-tuning.
☆ Extracting Social Support and Social Isolation Information from Clinical Psychiatry Notes: Comparing a Rule-based NLP System and a Large Language Model
Background: Social support (SS) and social isolation (SI) are social determinants of health (SDOH) associated with psychiatric outcomes. In electronic health records (EHRs), individual-level SS/SI is typically documented as narrative clinical notes rather than structured coded data. Natural language processing (NLP) algorithms can automate the otherwise labor-intensive process of data extraction. Data and Methods: Psychiatric encounter notes from Mount Sinai Health System (MSHS, n=300) and Weill Cornell Medicine (WCM, n=225) were annotated and established a gold standard corpus. A rule-based system (RBS) involving lexicons and a large language model (LLM) using FLAN-T5-XL were developed to identify mentions of SS and SI and their subcategories (e.g., social network, instrumental support, and loneliness). Results: For extracting SS/SI, the RBS obtained higher macro-averaged f-scores than the LLM at both MSHS (0.89 vs. 0.65) and WCM (0.85 vs. 0.82). For extracting subcategories, the RBS also outperformed the LLM at both MSHS (0.90 vs. 0.62) and WCM (0.82 vs. 0.81). Discussion and Conclusion: Unexpectedly, the RBS outperformed the LLMs across all metrics. Intensive review demonstrates that this finding is due to the divergent approach taken by the RBS and LLM. The RBS were designed and refined to follow the same specific rules as the gold standard annotations. Conversely, the LLM were more inclusive with categorization and conformed to common English-language understanding. Both approaches offer advantages and are made available open-source for future testing.
comment: 2 figures, 3 tables
GPT-4 Understands Discourse at Least as Well as Humans Do
We test whether a leading AI system GPT-4 understands discourse as well as humans do, using a standardized test of discourse comprehension. Participants are presented with brief stories and then answer eight yes/no questions probing their comprehension of the story. The questions are formatted to assess the separate impacts of directness (stated vs. implied) and salience (main idea vs. details). GPT-4 performs slightly, but not statistically significantly, better than humans given the very high level of human performance. Both GPT-4 and humans exhibit a strong ability to make inferences about information that is not explicitly stated in a story, a critical test of understanding.
comment: 7 pages, 1 figure, 3 tables
☆ NUMTEMP: A real-world benchmark to verify claims with statistical and temporal expressions
Automated fact checking has gained immense interest to tackle the growing misinformation in the digital era. Existing systems primarily focus on synthetic claims on Wikipedia, and noteworthy progress has also been made on real-world claims. In this work, we release Numtemp, a diverse, multi-domain dataset focused exclusively on numerical claims, encompassing temporal, statistical and diverse aspects with fine-grained metadata and an evidence collection without leakage. This addresses the challenge of verifying real-world numerical claims, which are complex and often lack precise information, not addressed by existing works that mainly focus on synthetic claims. We evaluate and quantify the limitations of existing solutions for the task of verifying numerical claims. We also evaluate claim decomposition based methods, numerical understanding based models and our best baselines achieves a macro-F1 of 58.32. This demonstrates that Numtemp serves as a challenging evaluation set for numerical claim verification.
comment: 17 pages, 1 figure
☆ Reflecting the Male Gaze: Quantifying Female Objectification in 19th and 20th Century Novels LREC
Inspired by the concept of the male gaze (Mulvey, 1975) in literature and media studies, this paper proposes a framework for analyzing gender bias in terms of female objectification: the extent to which a text portrays female individuals as objects of visual pleasure. Our framework measures female objectification along two axes. First, we compute an agency bias score that indicates whether male entities are more likely to appear in the text as grammatical agents than female entities. Next, by analyzing the word embedding space induced by a text (Caliskan et al., 2017), we compute an appearance bias score that indicates whether female entities are more closely associated with appearance-related words than male entities. Applying our framework to 19th and 20th century novels reveals evidence of female objectification in literature: we find that novels written from a male perspective systematically objectify female characters, while novels written from a female perspective do not exhibit statistically significant objectification of any gender.
comment: To appear in LREC-COLING 2024
☆ Task-Agnostic Detector for Insertion-Based Backdoor Attacks NAACL 2024
Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.
comment: Findings of NAACL 2024
☆ Outcome-Constrained Large Language Models for Countering Hate Speech
Counterspeech that challenges or responds to hate speech has been seen as an alternative to mitigate the negative impact of hate speech and foster productive online communications. Research endeavors have been directed to using language models for the automatic generation of counterspeech to assist efforts in combating online hate. Existing research focuses on the generation of counterspeech with certain linguistic attributes, such as being polite, informative, and intent-driven. However, it remains unclear what impact the counterspeech might have in an online environment. We first explore methods that utilize large language models (LLM) to generate counterspeech constrained by potential conversation outcomes. We build two conversation outcome classifiers that predict the incivility level and the hater reentry behavior following replies to hate with Reddit data, then propose four methods to incorporate the desired outcomes, i.e., low conversation incivility and non-hateful hater reentry, into the text generation process, including Prompt with Instructions, Prompt and Select, LLM finetune, and LLM transformer reinforcement learning (TRL). Evaluation results show effective strategies to generate outcome-constrained counterspeech and the linguistic characteristics of texts generated by different methods.
☆ Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language LREC
Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual experiments that could benefit many low-resource languages.
comment: Accepted to LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)
☆ MetaAligner: Conditional Weak-to-Strong Correction for Generalizable Multi-Objective Alignment of Language Models
Recent advancements in large language models (LLMs) aim to tackle heterogeneous human expectations and values via multi-objective preference alignment. However, existing methods are parameter-adherent to the policy model, leading to two key limitations: (1) the high-cost repetition of their alignment algorithms for each new target model; (2) they cannot expand to unseen objectives due to their static alignment objectives. In this work, we propose Meta-Objective Aligner (MetaAligner), a model that performs conditional weak-to-strong correction for weak responses to approach strong responses. MetaAligner is the first policy-agnostic and generalizable method for multi-objective preference alignment, which enables plug-and-play alignment by decoupling parameter updates from the policy models and facilitates zero-shot preference alignment for unseen objectives via in-context learning. Experimental results show that MetaAligner achieves significant and balanced improvements in multi-objective alignments on 11 policy models with up to 63x more parameters, and outperforms previous alignment methods with down to 22.27x less computational resources. The model also accurately aligns with unseen objectives, marking the first step towards generalizable multi-objective preference alignment.
comment: Work in progress, more general experimental results to come
☆ Exploring the Generalization of Cancer Clinical Trial Eligibility Classifiers Across Diseases
Clinical trials are pivotal in medical research, and NLP can enhance their success, with application in recruitment. This study aims to evaluate the generalizability of eligibility classification across a broad spectrum of clinical trials. Starting with phase 3 cancer trials, annotated with seven eligibility exclusions, then to determine how well models can generalize to non-cancer and non-phase 3 trials. To assess this, we have compiled eligibility criteria data for five types of trials: (1) additional phase 3 cancer trials, (2) phase 1 and 2 cancer trials, (3) heart disease trials, (4) type 2 diabetes trials, and (5) observational trials for any disease, comprising 2,490 annotated eligibility criteria across seven exclusion types. Our results show that models trained on the extensive cancer dataset can effectively handle criteria commonly found in non-cancer trials, such as autoimmune diseases. However, they struggle with criteria disproportionately prevalent in cancer trials, like prior malignancy. We also experiment with few-shot learning, demonstrating that a limited number of disease-specific examples can partially overcome this performance gap. We are releasing this new dataset of annotated eligibility statements to promote the development of cross-disease generalization in clinical trial classification.
☆ The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) without updating the models' parameters, in contrast to the traditional gradient-based finetuning. The promise of ICL is that the LLM can adapt to perform the present task at a competitive or state-of-the-art level at a fraction of the cost. The ability of LLMs to perform tasks in this few-shot manner relies on their background knowledge of the task (or task priors). However, recent work has found that, unlike traditional learning, LLMs are unable to fully integrate information from demonstrations that contrast task priors. This can lead to performance saturation at suboptimal levels, especially for subjective tasks such as emotion recognition, where the mapping from text to emotions can differ widely due to variability in human annotations. In this work, we design experiments and propose measurements to explicitly quantify the consistency of proxies of LLM priors and their pull on the posteriors. We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. We also find that the larger the model, the stronger these effects become. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain and when interpreting ICL results.
comment: 30 pages, 27 figures
☆ Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
Grounding the common-sense reasoning of Large Language Models in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://sites.google.com/view/grounding-plans
☆ Attribute First, then Generate: Locally-attributable Grounded Text Generation
Recent efforts to address hallucinations in Large Language Models (LLMs) have focused on attributed text generation, which supplements generated texts with citations of supporting sources for post-generation fact-checking and corrections. Yet, these citations often point to entire documents or paragraphs, burdening users with extensive verification work. In this paper, we introduce a locally-attributable text generation approach, prioritizing concise attributions. Our method, named ``Attribute First, then Generate'', breaks down the conventional end-to-end generation process into three intuitive steps: content selection, sentence planning, and sequential sentence generation. By initially identifying relevant source segments (``select first'') and then conditioning the generation process on them (``then generate''), we ensure these segments also act as the output's fine-grained attributions (``select'' becomes ``attribute''). Tested on Multi-document Summarization and Long-form Question-answering, our method not only yields more concise citations than the baselines but also maintains - and in some cases enhances - both generation quality and attribution accuracy. Furthermore, it significantly reduces the time required for fact verification by human assessors.
♻ ☆ LOCOST: State-Space Models for Long Document Abstractive Summarization EACL 2024
State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L \log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.
comment: 9 pages, 5 figures, 7 tables, EACL 2024 conference
♻ ☆ Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!
While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural "shortcuts", such as copying subwords from the source to the target, given that such language pairs often share a considerable number of identical words, cognates, and borrowings. We test Pointer-Generator Networks for this purpose for six language pairs over a variety of resource ranges, and find weak improvements for most settings. However, analysis shows that the model does not show greater improvements for closely-related vs. more distant language pairs, or for lower resource ranges, and that the models do not exhibit the expected usage of the mechanism for shared subwords. Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT, such as modern tokenization strategies, noisy real-world conditions, and linguistic complexities. We call for better scrutiny of linguistically motivated improvements to NMT given the blackbox nature of Transformer models, as well as for a focus on the above problems in the field.
comment: 4 pages
♻ ☆ A Second Look on BASS -- Boosting Abstractive Summarization with Unified Semantic Graphs -- A Replication Study ECIR 2024
We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Advances in Information Retrieval, 46th European Conference on Information Retrieval, ECIR 2024. 16 pages, 4 figures
♻ ☆ LongHeads: Multi-Head Attention is Secretly a Long Context Processor
Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention's quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM's long context ability by unlocking multi-head attention's untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads's efficacy in extending the usable context window for existing models. We release our code at https://github.com/LuLuLuyi/LongHeads .
♻ ☆ On the Relationship between Skill Neurons and Robustness in Prompt Tuning
Prompt Tuning is a popular parameter-efficient finetuning method for pre-trained large language models (PLMs). Based on experiments with RoBERTa, it has been suggested that Prompt Tuning activates specific neurons in the transformer's feed-forward networks, that are highly predictive and selective for the given task. In this paper, we study the robustness of Prompt Tuning in relation to these "skill neurons", using RoBERTa and T5. We show that prompts tuned for a specific task are transferable to tasks of the same type but are not very robust to adversarial data. While prompts tuned for RoBERTa yield below-chance performance on adversarial data, prompts tuned for T5 are slightly more robust and retain above-chance performance in two out of three cases. At the same time, we replicate the finding that skill neurons exist in RoBERTa and further show that skill neurons also exist in T5. Interestingly, the skill neurons of T5 determined on non-adversarial data are also among the most predictive neurons on the adversarial data, which is not the case for RoBERTa. We conclude that higher adversarial robustness may be related to a model's ability to consistently activate the relevant skill neurons on adversarial data.
♻ ☆ Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues LREC
During task-oriented dialogues (TODs), human users naturally introduce chitchat that is beyond the immediate scope of the task, interfering with the flow of the conversation. To address this issue without the need for expensive manual data creation, we use few-shot prompting with Llama-2-70B to enhance the MultiWOZ dataset with user backstories, a typical example of chitchat interference in TODs. We assess the impact of this addition by testing two models: one trained solely on TODs and another trained on TODs with a preliminary chitchat interaction. Our analysis demonstrates that our enhanced dataset poses a challenge for these systems. Moreover, we demonstrate that our dataset can be effectively used for training purposes, enabling a system to consistently acknowledge the user's backstory while also successfully moving the task forward in the same turn, as confirmed by human evaluation. These findings highlight the benefits of generating novel chitchat-TOD scenarios to test TOD systems more thoroughly and improve their resilience to natural user interferences
comment: Accepted @ LREC-COLING 2024
♻ ☆ MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models NAACL 2024
Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
comment: This paper has been accepted by NAACL 2024
♻ ☆ Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation LREC
The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De->Dsb and WMT-2014 En->De, respectively, compared to Transformer baselines.
comment: Accepted to LREC-COLING 2024
♻ ☆ To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems? LREC
Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.
comment: Accepted at LREC-COLING 2024; final camera-ready version
♻ ☆ HealthFC: Verifying Health Claims with Evidence-Based Medical Fact-Checking LREC
In the digital age, seeking health advice on the Internet has become a common practice. At the same time, determining the trustworthiness of online medical content is increasingly challenging. Fact-checking has emerged as an approach to assess the veracity of factual claims using evidence from credible knowledge sources. To help advance automated Natural Language Processing (NLP) solutions for this task, in this paper we introduce a novel dataset HealthFC. It consists of 750 health-related claims in German and English, labeled for veracity by medical experts and backed with evidence from systematic reviews and clinical trials. We provide an analysis of the dataset, highlighting its characteristics and challenges. The dataset can be used for NLP tasks related to automated fact-checking, such as evidence retrieval, claim verification, or explanation generation. For testing purposes, we provide baseline systems based on different approaches, examine their performance, and discuss the findings. We show that the dataset is a challenging test bed with a high potential for future use.
comment: Accepted to LREC-COLING 2024
♻ ☆ With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation
Long text generation, such as novel writing and discourse-level translation with extremely long contexts, presents significant challenges to current language models. Existing methods mainly focus on extending the model's context window through strategies like length extrapolation. However, these approaches demand substantial hardware resources during the training and/or inference phases. Our proposed method, Temp-Lora, introduces an alternative concept. Instead of relying on the KV cache to store all context information, we embeds this information directly into a temporary Lora module. In the process of long text generation, this module is progressively trained with text generated previously. This approach not only efficiently preserves contextual knowledge but also prevents any permanent alteration to the model's parameters given that the module is discarded post-generation. Extensive experiments on the PG19 language modeling benchmark and the GuoFeng discourse-level translation benchmark validate the effectiveness of Temp-Lora. Our results show that: 1) Temp-Lora substantially enhances generation quality for long text, as indicated by a 13.2% decrease in perplexity (PPL) on a subset of PG19, and a 29.3% decrease in PPL along with a 113.2% increase in BLEU score on a subset of GuoFeng, 2) Temp-Lora is compatible with and enhances most existing long text generation methods, and 3) Temp-Lora can greatly reduce computational costs by shortening the context window. For example, we can ensure a moderate improvement in generation quality (a decrease of 3.8% in PPL) while enabling a 51.5% memory usage reduction and a 60.0% decrease in latency for inference.
♻ ☆ Dial-MAE: ConTextual Masked Auto-Encoder for Retrieval-based Dialogue Systems NAACL 2024
Dialogue response selection aims to select an appropriate response from several candidates based on a given user and system utterance history. Most existing works primarily focus on post-training and fine-tuning tailored for cross-encoders. However, there are no post-training methods tailored for dense encoders in dialogue response selection. We argue that when the current language model, based on dense dialogue systems (such as BERT), is employed as a dense encoder, it separately encodes dialogue context and response, leading to a struggle to achieve the alignment of both representations. Thus, we propose Dial-MAE (Dialogue Contextual Masking Auto-Encoder), a straightforward yet effective post-training technique tailored for dense encoders in dialogue response selection. Dial-MAE uses an asymmetric encoder-decoder architecture to compress the dialogue semantics into dense vectors, which achieves better alignment between the features of the dialogue context and response. Our experiments have demonstrated that Dial-MAE is highly effective, achieving state-of-the-art performance on two commonly evaluated benchmarks.
comment: This paper has been accepted by NAACL 2024
♻ ☆ Effective Distillation of Table-based Reasoning Ability from LLMs
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their enormous parameter size and extremely high requirements for compute power pose challenges for their practical deployment. Recent research has revealed that specific capabilities of LLMs, such as numerical reasoning, can be transferred to smaller models through distillation. Some studies explore the potential of leveraging LLMs to perform table-based reasoning. However, there has been no prior work focusing on table reasoning skills in smaller models specifically tailored for scientific table-to-text generation tasks. In this paper, we propose a novel table-based reasoning distillation approach, with the aim of distilling LLMs into tailored smaller models. Our experimental results have shown that a 220 million parameter model (Flan-T5-base) fine-tuned using distilled data, not only achieves a significant improvement compared to traditionally fine-tuned baselines, but also surpasses specific LLMs on a scientific table-to-text generation dataset. Our code is available at https://github.com/Bernard-Yang/DistillTableCoT.
♻ ☆ HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models CVPR 2024
We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.
comment: Accepted to CVPR 2024
♻ ☆ A Survey of Confidence Estimation and Calibration in Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations. There has been a lot of recent research aiming to address this, but there has been no comprehensive overview to organize it and outline the main lessons learned. The present survey aims to bridge this gap. In particular, we outline the challenges and we summarize recent technical advancements for LLM confidence estimation and calibration. We further discuss their applications and suggest promising directions for future work.
comment: 16 pages, 1 page, 1 table
♻ ☆ Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.
♻ ☆ Exploring ChatGPT and its Impact on Society
Artificial intelligence has been around for a while, but suddenly it has received more attention than ever before. Thanks to innovations from companies like Google, Microsoft, Meta, and other major brands in technology. OpenAI, though, has triggered the button with its ground-breaking invention ChatGPT. ChatGPT is a Large Language Model (LLM) based on Transformer architecture that has the ability to generate human-like responses in a conversational context. It uses deep learning algorithms to generate natural language responses to input text. Its large number of parameters, contextual generation, and open-domain training make it a versatile and effective tool for a wide range of applications, from chatbots to customer service to language translation. It has the potential to revolutionize various industries and transform the way we interact with technology. However, the use of ChatGPT has also raised several concerns, including ethical, social, and employment challenges, which must be carefully considered to ensure the responsible use of this technology. The article provides an overview of ChatGPT, delving into its architecture and training process. It highlights the potential impacts of ChatGPT on the society. In this paper, we suggest some approaches involving technology, regulation, education, and ethics in an effort to maximize ChatGPT's benefits while minimizing its negative impacts. This study is expected to contribute to a greater understanding of ChatGPT and aid in predicting the potential changes it may bring about.
comment: 13 Pages
♻ ☆ SEA: Sparse Linear Attention with Estimated Attention Mask
The transformer architecture has driven breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, long seqeuences pose a problem due to the quadratic complexity of the attention operation. Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches lose interpretability if they cannot produce full attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then subsequently creates a sparse attention matrix with a top-k selection to perform a sparse attention operation. For language modeling tasks (Wikitext2), previous linear and sparse attention methods show roughly two-fold worse perplexity scores over the quadratic OPT-1.3B baseline, while SEA achieves better perplexity than OPT-1.3B, using roughly half the memory of OPT-1.3B, providing interpretable attention matrix. We believe that our work will have a large practical impact, as it opens the possibility of running large transformers on resource-limited devices with less memory.
comment: 9 main pages
♻ ☆ Situated Natural Language Explanations
Natural language is among the most accessible tools for explaining decisions to humans, and large pretrained language models (PLMs) have demonstrated impressive abilities to generate coherent natural language explanations (NLE). The existing NLE research perspectives do not take the audience into account. An NLE can have high textual quality, but it might not accommodate audiences' needs and preference. To address this limitation, we propose an alternative perspective, \textit{situated} NLE. On the evaluation side, we set up automated evaluation scores. These scores describe the properties of NLEs in lexical, semantic, and pragmatic categories. On the generation side, we identify three prompt engineering techniques and assess their applicability on the situations. Situated NLE provides a perspective and facilitates further research on the generation and evaluation of explanations.
♻ ☆ A Transfer Attack to Image Watermarks
Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In particular, multiple studies claimed that image watermark is robust in such setting. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
♻ ☆ A Survey on Large Language Model based Autonomous Agents
Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at https://github.com/Paitesanshi/LLM-Agent-Survey.
comment: 35 pages, 5 figures, 3 tables, has been accepted by frontiers of computer science (FCS), doi={10.1007/s11704-024-40231-1}
♻ ☆ OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models NAACL 2024
Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object Navigation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user's demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a Versatile Semantic Score Map (VSSM). Then, by conducting common sense reasoning on VSSM, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method's effectiveness. Furthermore, we perform real robot demonstrations to validate our method's open-set-ness and generalizability to real-world environments.
comment: NAACL 2024 Findings
♻ ☆ Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation NAACL 2024
Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.
comment: NAACL 2024
♻ ☆ OffLanDat: A Community Based Implicit Offensive Language Dataset Generated by Large Language Model Through Prompt Engineering
The widespread presence of offensive languages on social media has resulted in adverse effects on societal well-being. As a result, it has become very important to address this issue with high priority. Offensive languages exist in both explicit and implicit forms, with the latter being more challenging to detect. Current research in this domain encounters several challenges. Firstly, the existing datasets primarily rely on the collection of texts containing explicit offensive keywords, making it challenging to capture implicitly offensive contents that are devoid of these keywords. Secondly, usual methodologies tend to focus solely on textual analysis, neglecting the valuable insights that community information can provide. In this research paper, we introduce a novel dataset OffLanDat, a community based implicit offensive language dataset generated by ChatGPT containing data for 38 different target groups. Despite limitations in generating offensive texts using ChatGPT due to ethical constraints, we present a prompt-based approach that effectively generates implicit offensive languages. To ensure data quality, we evaluate our data with human. Additionally, we employ a prompt-based Zero-Shot method with ChatGPT and compare the detection results between human annotation and ChatGPT annotation. We utilize existing state-of-the-art models to see how effective they are in detecting such languages. We will make our code and dataset public for other researchers.
♻ ☆ Enhancing End-to-End Multi-Task Dialogue Systems: A Study on Intrinsic Motivation Reinforcement Learning Algorithms for Improved Training and Adaptability
End-to-end multi-task dialogue systems are usually designed with separate modules for the dialogue pipeline. Among these, the policy module is essential for deciding what to do in response to user input. This policy is trained by reinforcement learning algorithms by taking advantage of an environment in which an agent receives feedback in the form of a reward signal. The current dialogue systems, however, only provide meagre and simplistic rewards. Investigating intrinsic motivation reinforcement learning algorithms is the goal of this study. Through this, the agent can quickly accelerate training and improve its capacity to judge the quality of its actions by teaching it an internal incentive system. In particular, we adapt techniques for random network distillation and curiosity-driven reinforcement learning to measure the frequency of state visits and encourage exploration by using semantic similarity between utterances. Experimental results on MultiWOZ, a heterogeneous dataset, show that intrinsic motivation-based debate systems outperform policies that depend on extrinsic incentives. By adopting random network distillation, for example, which is trained using semantic similarity between user-system dialogues, an astounding average success rate of 73% is achieved. This is a significant improvement over the baseline Proximal Policy Optimization (PPO), which has an average success rate of 60%. In addition, performance indicators such as booking rates and completion rates show a 10% rise over the baseline. Furthermore, these intrinsic incentive models help improve the system's policy's resilience in an increasing amount of domains. This implies that they could be useful in scaling up to settings that cover a wider range of domains.
comment: 6 pages, 1 figure, 18th IEEE International Conference on Semantic Computing
♻ ☆ MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods
Recent research in decoding methods for Natural Language Generation (NLG) tasks has shown that MAP decoding is not optimal, because model probabilities do not always align with human preferences. Stronger decoding methods, including Quality Estimation (QE) reranking and Minimum Bayes' Risk (MBR) decoding, have since been proposed to mitigate the model-perplexity-vs-quality mismatch. While these decoding methods achieve state-of-the-art performance, they are prohibitively expensive to compute. In this work, we propose MBR finetuning and QE finetuning which distill the quality gains from these decoding methods at training time, while using an efficient decoding algorithm at inference time. Using the canonical NLG task of Neural Machine Translation (NMT), we show that even with self-training, these finetuning methods significantly outperform the base model. Moreover, when using an external LLM as a teacher model, these finetuning methods outperform finetuning on human-generated references. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data, while maintaining maximum efficiency during decoding.
♻ ☆ Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion NAACL 2024
Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? This work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection.
comment: NAACL 2024 Camera Ready
♻ ☆ Learning Transfers over Several Programming Languages
Large language models (LLMs) have become remarkably good at improving developer productivity for high-resource programming languages. These models use two kinds of data: large amounts of unlabeled code samples for pre-training and relatively smaller amounts of labeled code samples for fine-tuning or in-context learning. Unfortunately, many programming languages are low-resource, lacking labeled samples for most tasks and often even lacking unlabeled samples. Therefore, users of low-resource languages (e.g., legacy or new languages) miss out on the benefits of LLMs. Cross-lingual transfer uses data from a source language to improve model performance on a target language. It has been well-studied for natural languages, but has received little attention for programming languages. This paper reports extensive experiments on four tasks using a transformer-based LLM and 11 to 41 programming languages to explore the following questions. First, how well does cross-lingual transfer work for a given task across different language pairs. Second, given a task and target language, how should one choose a source language. Third, which characteristics of a language pair are predictive of transfer performance, and how does that depend on the given task. Our empirical study with 1,808 experiments reveals practical and scientific insights, such as Kotlin and JavaScript being the most transferable source languages and different tasks relying on substantially different features. Overall, we find that learning transfers well across several programming languages.
comment: 15 pages, 9 figures, 8 tables
♻ ☆ Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark
Text-to-SQL, which involves translating natural language into Structured Query Language (SQL), is crucial for enabling broad access to structured databases without expert knowledge. However, designing models for such tasks is challenging due to numerous factors, including the presence of 'noise,' such as ambiguous questions and syntactical errors. This study provides an in-depth analysis of the distribution and types of noise in the widely used BIRD-Bench benchmark and the impact of noise on models. While BIRD-Bench was created to model dirty and noisy database values, it was not created to contain noise and errors in the questions and gold queries. We found that noise in questions and gold queries are prevalent in the dataset, with varying amounts across domains, and with an uneven distribution between noise types. The presence of incorrect gold SQL queries, which then generate incorrect gold answers, has a significant impact on the benchmark's reliability. Surprisingly, when evaluating models on corrected SQL queries, zero-shot baselines surpassed the performance of state-of-the-art prompting methods. We conclude that informative noise labels and reliable benchmarks are crucial to developing new Text-to-SQL methods that can handle varying types of noise. All datasets, annotations, and code are available at https://github.com/niklaswretblad/the-effects-of-noise-in-text-to-SQL.
♻ ☆ Othering and low status framing of immigrant cuisines in US restaurant reviews and large language models
Identifying implicit attitudes toward food can mitigate social prejudice due to food's salience as a marker of ethnic identity. Stereotypes about food are representational harms that may contribute to racialized discourse and negatively impact economic outcomes for restaurants. Understanding the presence of representational harms in online corpora in particular is important, given the increasing use of large language models (LLMs) for text generation and their tendency to reproduce attitudes in their training data. Through careful linguistic analyses, we evaluate social theories about attitudes toward immigrant cuisine in a large-scale study of framing differences in 2.1M English language Yelp reviews. Controlling for factors such as restaurant price and neighborhood racial diversity, we find that immigrant cuisines are more likely to be othered using socially constructed frames of authenticity (e.g., "authentic," "traditional"), and that non-European cuisines (e.g., Indian, Mexican) in particular are described as more exotic compared to European ones (e.g., French). We also find that non-European cuisines are more likely to be described as cheap and dirty, even after controlling for price, and even among the most expensive restaurants. Finally, we show that reviews generated by LLMs reproduce similar framing tendencies, pointing to the downstream retention of these representational harms. Our results corroborate social theories of gastronomic stereotyping, revealing racialized evaluative processes and linguistic strategies through which they manifest.
comment: ICWSM '24
♻ ☆ Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification AAAI
Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.
comment: Accepted at the AAAI-MAKE 24
♻ ☆ Visual Grounding Helps Learn Word Meanings in Low-Data Regimes NAACL 2024
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.
comment: Accepted by NAACL 2024
♻ ☆ Generative Pre-training for Speech with Flow Matching ICLR 2024
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
comment: ICLR 2024
Software Engineering
☆ Looking back and forward: A retrospective and future directions on Software Engineering for systems-of-systems
Modern systems are increasingly connected and more integrated with other existing systems, giving rise to systems-of-systems (SoS). An SoS consists of a set of independent, heterogeneous systems that interact to provide new functionalities and accomplish global missions through emergent behavior manifested at runtime. The distinctive characteristics of SoS, when contrasted to traditional systems, pose significant research challenges within Software Engineering. These challenges motivate the need for a paradigm shift and the exploration of novel approaches for designing, developing, deploying, and evolving these systems. The International Workshop on Software Engineering for Systems-of-Systems (SESoS) series started in 2013 to fill a gap in scientific forums addressing SoS from the Software Engineering perspective, becoming the first venue for this purpose. This article presents a study aimed at outlining the evolution and future trajectory of Software Engineering for SoS based on the examination of 57 papers spanning the 11 editions of the SESoS workshop (2013-2023). The study combined scoping review and scientometric analysis methods to categorize and analyze the research contributions concerning temporal and geographic distribution, topics of interest, research methodologies employed, application domains, and research impact. Based on such a comprehensive overview, this article discusses current and future directions in Software Engineering for SoS.
☆ Design Patterns for Multilevel Modeling and Simulation
Multilevel modeling and simulation (M&S) is becoming increasingly relevant due to the benefits that this methodology offers. Multilevel models allow users to describe a system at multiple levels of detail. From one side, this can make better use of computational resources, since the more detailed and time-consuming models can be executed only when/where required. From the other side, multilevel models can be assembled from existing components, cutting down development and verification/validation time. A downside of multilevel M&S is that the development process becomes more complex due to some recurrent issues caused by the very nature of multilevel models: how to make sub-models interoperate, how to orchestrate execution, how state variables are to be updated when changing scale, and so on. In this paper, we address some of these issues by presenting a set of design patterns that provide a systematic approach for designing and implementing multilevel models. The proposed design patterns cover multiple aspects, including how to represent different levels of detail, how to combine incompatible models, how to exchange data across models, and so on. Some of the patterns are derived from the general software engineering literature, while others are specific to the multilevel M&S application area.
☆ ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search LREC
Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.
comment: Accepted to LREC-COLING 2024
☆ Investigating the Readability of Test Code: Combining Scientific and Practical Views
The readability of source code is key for understanding and maintaining software systems and tests. Several studies investigate the readability of source code, but there is limited research on the readability of test code and related influence factors. We investigate the factors that influence the readability of test code from an academic perspective complemented by practical views. First, we perform a Systematic Mapping Study (SMS) with a focus on scientific literature. Second, we extend this study by reviewing grey literature sources for practical aspects on test code readability and understandability. Finally, we conduct a controlled experiment on the readability of a selected set of test cases to collect additional knowledge on influence factors discussed in practice. The result set of the SMS includes 19 primary studies from the scientific literature. The grey literature search reveals 62 sources for information on test code readability. Based on an analysis of these sources, we identified a combined set of 14 factors that influence the readability of test code. 7 of these factors were found in scientific and grey literature, while some factors were mainly discussed in academia (2) or industry (5) with limited overlap. The controlled experiment on practically relevant influence factors showed that the investigated factors have a significant impact on readability for half of the selected test cases. Our review of scientific and grey literature showed that test code readability is of interest for academia and industry with a consensus on key influence factors. However, we also found factors only discussed by practitioners. For some of these factors we were able to confirm an impact on readability in a first experiment. Therefore, we see the need to bring together academic and industry viewpoints to achieve a common view on the readability of software test code.
☆ Exposing the hidden layers and interplay in the quantum software stack
Current and near-future quantum computers face resource limitations due to noise and low qubit counts. Despite this, effective quantum advantage can still be achieved due to the exponential nature of bit-to-qubit conversion. However, optimizing the software architecture of these systems is essential to utilize available resources efficiently. Unfortunately, the focus on user-friendly quantum computers has obscured critical steps in the software stack, leading to ripple effects into the stack's upper layer induced by limitations in current qubit implementations. This paper unveils the hidden interplay among layers of the quantum software stack.
☆ Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models ISSTA2024
Recent studies show that deployed deep learning (DL) models such as those of Tensor Flow Lite (TFLite) can be easily extracted from real-world applications and devices by attackers to generate many kinds of attacks like adversarial attacks. Although securing deployed on-device DL models has gained increasing attention, no existing methods can fully prevent the aforementioned threats. Traditional software protection techniques have been widely explored, if on-device models can be implemented using pure code, such as C++, it will open the possibility of reusing existing software protection techniques. However, due to the complexity of DL models, there is no automatic method that can translate the DL models to pure code. To fill this gap, we propose a novel method, CustomDLCoder, to automatically extract the on-device model information and synthesize a customized executable program for a wide range of DL models. CustomDLCoder first parses the DL model, extracts its backend computing units, configures the computing units to a graph, and then generates customized code to implement and deploy the ML solution without explicit model representation. The synthesized program hides model information for DL deployment environments since it does not need to retain explicit model representation, preventing many attacks on the DL model. In addition, it improves ML performance because the customized code removes model parsing and preprocessing steps and only retains the data computing process. Our experimental results show that CustomDLCoder improves model security by disabling on-device model sniffing. Compared with the original on-device platform (i.e., TFLite), our method can accelerate model inference by 21.0% and 24.3% on x86-64 and ARM64 platforms, respectively. Most importantly, it can significantly reduce memory consumption by 68.8% and 36.0% on x86-64 and ARM64 platforms, respectively.
comment: Accepted by the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA2024)
☆ CodeS: Natural Language to Code Repository via Multi-Layer Sketch
The impressive performance of large language models (LLMs) on code-related tasks has shown the potential of fully automated software development. In light of this, we introduce a new software engineering task, namely Natural Language to code Repository (NL2Repo). This task aims to generate an entire code repository from its natural language requirements. To address this task, we propose a simple yet effective framework CodeS, which decomposes NL2Repo into multiple sub-tasks by a multi-layer sketch. Specifically, CodeS includes three modules: RepoSketcher, FileSketcher, and SketchFiller. RepoSketcher first generates a repository's directory structure for given requirements; FileSketcher then generates a file sketch for each file in the generated structure; SketchFiller finally fills in the details for each function in the generated file sketch. To rigorously assess CodeS on the NL2Repo task, we carry out evaluations through both automated benchmarking and manual feedback analysis. For benchmark-based evaluation, we craft a repository-oriented benchmark, SketchEval, and design an evaluation metric, SketchBLEU. For feedback-based evaluation, we develop a VSCode plugin for CodeS and engage 30 participants in conducting empirical studies. Extensive experiments prove the effectiveness and practicality of CodeS on the NL2Repo task.
comment: https://github.com/NL2Code/CodeS
☆ Evaluating Large Language Models with Runtime Behavior of Program Execution
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs.
☆ A Mixed Method Study of DevOps Challenges
Context: DevOps practices combine software development and IT operations. There is a growing number of DevOps related posts in popular online developer forum Stack Overflow (SO). While previous research analyzed SO posts related to build/release engineering, we are aware of no research that specifically focused on DevOps related discussions. Objective: To learn the challenges developers face while using the currently available DevOps tools and techniques along with the organizational challenges in DevOps practices. Method: We conduct an empirical study by applying topic modeling on 174K SO posts that contain DevOps discussions. We then validate and extend the empirical study findings with a survey of 21 professional DevOps practitioners. Results: We find that: (1) There are 23 DevOps topics grouped into four categories: Cloud & CI/CD Tools, Infrastructure as Code, Container & Orchestration, and Quality Assurance. (2) The topic category Cloud & CI/CD Tools contains the highest number of topics (10) which cover 48.6% of all questions in our dataset, followed by the category Infrastructure as Code (28.9%). (3) File management is the most popular topic followed by Jenkins Pipeline, while infrastructural Exception Handling and Jenkins Distributed Architecture are the most difficult topics (with least accepted answers). (4) In the survey, developers mention that it requires hands-on experience before current DevOps tools can be considered easy. They raised the needs for better documentation and learning resources to learn the rapidly changing DevOps tools and techniques. Practitioners also emphasized on the formal training approach by the organizations for DevOps skill development. Conclusion: Architects and managers can use the findings of this research to adopt appropriate DevOps technologies, and organizations can design tool or process specific DevOps training programs.
☆ AgentFL: Scaling LLM-based Fault Localization to Project-Level Context
Fault Localization (FL) is an essential step during the debugging process. With the strong capabilities of code comprehension, the recent Large Language Models (LLMs) have demonstrated promising performance in diagnosing bugs in the code. Nevertheless, due to LLMs' limited performance in handling long contexts, existing LLM-based fault localization remains on localizing bugs within a small code scope (i.e., a method or a class), which struggles to diagnose bugs for a large code scope (i.e., an entire software system). To address the limitation, this paper presents AgentFL, a multi-agent system based on ChatGPT for automated fault localization. By simulating the behavior of a human developer, AgentFL models the FL task as a three-step process, which involves comprehension, navigation, and confirmation. Within each step, AgentFL hires agents with diversified expertise, each of which utilizes different tools to handle specific tasks. Particularly, we adopt a series of auxiliary strategies such as Test Behavior Tracking, Document-Guided Search, and Multi-Round Dialogue to overcome the challenges in each step. The evaluation on the widely used Defects4J-V1.2.0 benchmark shows that AgentFL can localize 157 out of 395 bugs within Top-1, which outperforms the other LLM-based approaches and exhibits complementarity to the state-of-the-art learning-based techniques. Additionally, we confirm the indispensability of the components in AgentFL with the ablation study and demonstrate the usability of AgentFL through a user study. Finally, the cost analysis shows that AgentFL spends an average of only 0.074 dollars and 97 seconds for a single bug.
☆ ChatDBG: An AI-Powered Debugging Assistant
This paper presents ChatDBG, the first AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to take the wheel and drive debugging by issuing commands to navigate through stacks and inspect program state; it then reports its findings and yields back control to the programmer. Our ChatDBG prototype integrates with standard debuggers including LLDB, GDB, and WinDBG for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded nearly 30,000 times.
comment: 11 pages
☆ ChatGPT Incorrectness Detection in Software Reviews
We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.
☆ A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection
Large Language Models (LLMs) have demonstrated great potential for code generation and other software engineering tasks. Vulnerability detection is of crucial importance to maintaining the security, integrity, and trustworthiness of software systems. Precise vulnerability detection requires reasoning about the code, making it a good case study for exploring the limits of LLMs' reasoning capabilities. Although recent work has applied LLMs to vulnerability detection using generic prompting techniques, their full capabilities for this task and the types of errors they make when explaining identified vulnerabilities remain unclear. In this paper, we surveyed eleven LLMs that are state-of-the-art in code generation and commonly used as coding assistants, and evaluated their capabilities for vulnerability detection. We systematically searched for the best-performing prompts, incorporating techniques such as in-context learning and chain-of-thought, and proposed three of our own prompting methods. Our results show that while our prompting methods improved the models' performance, LLMs generally struggled with vulnerability detection. They reported 0.5-0.63 Balanced Accuracy and failed to distinguish between buggy and fixed versions of programs in 76% of cases on average. By comprehensively analyzing and categorizing 287 instances of model reasoning, we found that 57% of LLM responses contained errors, and the models frequently predicted incorrect locations of buggy code and misidentified bug types. LLMs only correctly localized 6 out of 27 bugs in DbgBench, and these 6 bugs were predicted correctly by 70-100% of human participants. These findings suggest that despite their potential for other tasks, LLMs may fail to properly comprehend critical code structures and security-related concepts. Our data and code are available at https://figshare.com/s/78fe02e56e09ec49300b.
☆ Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation
Code translation between programming languages is a long-existing and critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. With the recent advances in large language models (LLMs) and their applications to code translation, there is an increasing need for comprehensive evaluation of these models. In this study, we empirically analyze the generated outputs of eleven popular instruct-tuned LLMs with parameters ranging from 1B up to 46.7B on 3,820 translation pairs across five languages, including C, C++, Go, Java, and Python. Our analysis found that between 26.4% and 73.7% of code translations produced by our evaluated LLMs necessitate post-processing, as these translations often include a mix of code, quotes, and text rather than being purely source code. Overlooking the output format of these models can inadvertently lead to underestimation of their actual performance. This is particularly evident when evaluating them with execution-based metrics such as Computational Accuracy (CA). Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output. In particular, our method can help eleven selected models achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our findings shed light on and motivate future research to conduct more reliable benchmarks of LLMs for code translation.
comment: Accepted into 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge)
☆ Generation of Asset Administration Shell with Large Language Model Agents: Interoperability in Digital Twins with Semantic Node
This research introduces a novel approach for assisting the creation of Asset Administration Shell (AAS) instances for digital twin modeling within the context of Industry 4.0, aiming to enhance interoperability in smart manufacturing and reduce manual effort. We construct a "semantic node" data structure to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process "semantic node" and generate AAS instance models from textual technical data. Our evaluation demonstrates a 62-79% effective generation rate, indicating a substantial proportion of manual creation effort can be converted into easier validation effort, thereby reducing the time and cost in creating AAS instance models. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts. Our findings emphasize LLMs' capability in automating AAS instance creation, enhancing semantic interoperability, and contributing to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are released on our GitHub Repository with the link: https://github.com/YuchenXia/AASbyLLM
comment: Pre-print, submitted to IEEE ACCESS, under peer-review
☆ On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance
Deciding what combination of operators to use across the Edge AI tiers to achieve specific latency and model performance requirements is an open question for MLOps engineers. This study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators and deployment tiers. In this paper, we conduct inference experiments involving 3 deployment operators (i.e., Partitioning, Quantization, Early Exit), 3 deployment tiers (i.e., Mobile, Edge, Cloud) and their combinations on four widely used Computer-Vision models to investigate the optimal strategies from the point of view of MLOps developers. Our findings suggest that Edge deployment using the hybrid Quantization + Early Exit operator could be preferred over non-hybrid operators (Quantization/Early Exit on Edge, Partition on Mobile-Edge) when faster latency is a concern at medium accuracy loss. However, when minimizing accuracy loss is a concern, MLOps engineers should prefer using only a Quantization operator on edge at a latency reduction or increase, respectively over the Early Exit/Partition (on edge/mobile-edge) and Quantized Early Exit (on edge) operators. In scenarios constrained by Mobile CPU/RAM resources, a preference for Partitioning across mobile and edge tiers is observed over mobile deployment. For models with smaller input data samples (such as FCN), a network-constrained cloud deployment can also be a better alternative than Mobile/Edge deployment and Partitioning strategies. For models with large input data samples (ResNet, ResNext, DUC), an edge tier having higher network/computational capabilities than Cloud/Mobile can be a more viable option than Partitioning and Mobile/Cloud deployment strategies.
☆ RepairAgent: An Autonomous, LLM-Based Agent for Program Repair
Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent's effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI's GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
☆ Concerned with Data Contamination? Assessing Countermeasures in Code Language Model
Various techniques have been proposed to leverage the capabilities of code language models (CLMs) for SE tasks. While these techniques typically evaluate their effectiveness using publicly available datasets, the evaluation can be subject to data contamination threats where the evaluation datasets have already been used to train the concerned CLMs. This can significantly affect the reliability of the evaluation. Different countermeasures have been suggested to mitigate the data contamination threat. Countermeasures include using more recent data, curating new data, and refactoring existing data are introduced, yet it is unclear whether these countermeasures could really mitigate data contamination threats to model evaluation. To fill the gap, we systematically study to quantify the impacts of these countermeasures on CLMs' performance. To facilitate the study, we collected over 2 million Python functions with timestamps ranging from January 1st, 2018, to December 31st, 2023. The data created before the models' cut-off date are considered "contaminated data", while the data where the countermeasures are taken are regarded as "cleansed data". We study the impact of these countermeasures by investigating the difference in CLMs' performance on contaminated and cleansed data derived from different countermeasures. Our experiments yield several interesting observations. For instance, CLMs do not necessarily perform worse on data after the models' cut-off date; on the contrary, they sometimes perform better. In addition, refactoring did not always result in decreased performance; it could lead to improvements instead. Furthermore, existing metrics such as perplexity cannot distinguish contaminated/cleansed data. We hope that the results and observations could help deepen the understanding of CLMs' capabilities and inform the community about data contamination.
☆ Seeking Enlightenment: Incorporating Evidence-Based Practice Techniques in a Research Software Engineering Team
Evidence-based practice (EBP) in software engineering aims to improve decision-making in software development by complementing practitioners' professional judgment with high-quality evidence from research. We believe the use of EBP techniques may be helpful for research software engineers (RSEs) in their work to bring software engineering best practices to scientific software development. In this study, we present an experience report on the use of a particular EBP technique, rapid reviews, within an RSE team at Sandia National Laboratories, and present practical recommendations for how to address barriers to EBP adoption within the RSE community.
comment: 1st Annual Conference of the United States Research Software Engineer Association. 10 pages, 2 figures
☆ Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback
Large language models (LLMs) have shown remarkable progress in automated code generation. Yet, incorporating LLM-based code generation into real-life software projects poses challenges, as the generated code may contain errors in API usage, class, data structure, or missing project-specific information. As much of this project-specific context cannot fit into the prompts of LLMs, we must find ways to allow the model to explore the project-level code context. To this end, this paper puts forward a novel approach, termed ProCoder, which iteratively refines the project-level code context for precise code generation, guided by the compiler feedback. In particular, ProCoder first leverages compiler techniques to identify a mismatch between the generated code and the project's context. It then iteratively aligns and fixes the identified errors using information extracted from the code repository. We integrate ProCoder with two representative LLMs, i.e., GPT-3.5-Turbo and Code Llama (13B), and apply it to Python code generation. Experimental results show that ProCoder significantly improves the vanilla LLMs by over 80% in generating code dependent on project context, and consistently outperforms the existing retrieval-based code generation baselines.
☆ DeepKnowledge: Generalisation-Driven Deep Learning Testing
Despite their unprecedented success, DNNs are notoriously fragile to small shifts in data distribution, demanding effective testing techniques that can assess their dependability. Despite recent advances in DNN testing, there is a lack of systematic testing approaches that assess the DNN's capability to generalise and operate comparably beyond data in their training distribution. We address this gap with DeepKnowledge, a systematic testing methodology for DNN-based systems founded on the theory of knowledge generalisation, which aims to enhance DNN robustness and reduce the residual risk of 'black box' models. Conforming to this theory, DeepKnowledge posits that core computational DNN units, termed Transfer Knowledge neurons, can generalise under domain shift. DeepKnowledge provides an objective confidence measurement on testing activities of DNN given data distribution shifts and uses this information to instrument a generalisation-informed test adequacy criterion to check the transfer knowledge capacity of a test set. Our empirical evaluation of several DNNs, across multiple datasets and state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepKnowledge and its ability to support the engineering of more dependable DNNs. We report improvements of up to 10 percentage points over state-of-the-art coverage criteria for detecting adversarial attacks on several benchmarks, including MNIST, SVHN, and CIFAR.
comment: 10 pages
☆ Enhancing Software Effort Estimation through Reinforcement Learning-based Project Management-Oriented Feature Selection
Purpose: The study aims to investigate the application of the data element market in software project management, focusing on improving effort estimation by addressing challenges faced by traditional methods. Design/methodology/approach: This study proposes a solution based on feature selection, utilizing the data element market and reinforcement learning-based algorithms to enhance the accuracy of software effort estimation. It explores the application of the MARLFS algorithm, customizing improvements to the algorithm and reward function. Findings: This study demonstrates that the proposed approach achieves more precise estimation compared to traditional methods, leveraging feature selection to guide project management in software development. Originality/value: This study contributes to the field by offering a novel approach that combines the data element market, machine learning, and feature selection to improve software effort estimation, addressing limitations of traditional methods and providing insights for future research in project management.
comment: 18pages, 10 figures, 6 tables
♻ ☆ Solving Data-centric Tasks using Large Language Models NAACL 2024
Large language models (LLMs) are rapidly replacing help forums like StackOverflow, and are especially helpful for non-professional programmers and end users. These users are often interested in data-centric tasks, such as spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including the data. But how do we decide how much data and which data to include in the prompt? This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, our cluster-then-select technique outperforms a random selection baseline.
comment: Paper accepted to NAACL 2024 (Findings)
♻ ☆ Fix-Con: Automatic Fault Localization and Repair of Deep Learning Model Conversions between Frameworks
Converting deep learning models between frameworks is a common step to maximize model compatibility across devices and leverage optimization features that may be exclusively provided in one deep learning framework. However, this conversion process may be riddled with bugs, making the converted models either undeployable or problematic, considerably degrading their prediction correctness. In this paper we propose an automated approach for fault localization and repair, Fix-Con, during model conversion between deep learning frameworks. Fix-Con is capable of detecting and fixing faults introduced in model input, parameters, hyperparameters, and the model graph during conversion. Fix-Con uses a set of fault types (mined from surveying conversion issues reported \nick{in code repositories and forums}) to localize potential conversion faults in the converted target model and then repair them appropriately, e.g., replacing the parameters of the target model with those from the source model. This is done iteratively for every image in the dataset, comparing output label differences between the source model and the converted target model until all differences are resolved. We evaluate the effectiveness of Fix-Con in fixing model conversion bugs of three widely used image recognition models converted across four different deep learning frameworks. Overall, Fix-Con was able to fix $462$ out of $755$ detected conversion faults, either completely repairing or significantly improving the performance of $14$ out of the $15$ erroneous conversion cases.
comment: 12 pages, 4 figures, 3 tables, 1 algorithm
♻ ☆ Fault Localization for Buggy Deep Learning Framework Conversions in Image Recognition
When deploying Deep Neural Networks (DNNs), developers often convert models from one deep learning framework to another (e.g., TensorFlow to PyTorch). However, this process is error-prone and can impact target model accuracy. To identify the extent of such impact, we perform and briefly present a differential analysis against three DNNs widely used for image recognition (MobileNetV2, ResNet101, and InceptionV3) converted across four well-known deep learning frameworks (PyTorch, Keras, TensorFlow (TF), and TFLite), which revealed numerous model crashes and output label discrepancies of up to 100%. To mitigate such errors, we present a novel approach towards fault localization and repair of buggy deep learning framework conversions, focusing on pre-trained image recognition models. Our technique consists of four stages of analysis: 1) conversion tools, 2) model parameters, 3) model hyperparameters, and 4) graph representation. In addition, we propose various strategies towards fault repair of the faults detected. We implement our technique on top of the Apache TVM deep learning compiler, and we test it by conducting a preliminary fault localization analysis for the conversion of InceptionV3 from TF to TFLite. Our approach detected a fault in a common DNN converter tool, which introduced precision errors in weights, reducing model accuracy. After our fault localization, we repaired the issue, reducing our conversion error to zero.
comment: 5 pages, 3 figures, 1 table
♻ ☆ DeltaNN: Assessing the Impact of Computational Environment Parameters on the Performance of Image Recognition Models
Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and TPUs for fast, timely processing. Failure in real-time image recognition tasks can occur due to sub-optimal mapping on hardware accelerators during model deployment, which may lead to timing uncertainty and erroneous behavior. Mapping on hardware accelerators is done using multiple software components like deep learning frameworks, compilers, and device libraries, that we refer to as the computational environment. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment, as the impact of parameters like deep learning frameworks, compiler optimizations, and hardware devices on model performance and correctness is not yet well understood. In this paper we present a differential testing framework, DeltaNN, that allows us to assess the impact of different computational environment parameters on the performance of image recognition models during deployment, post training. DeltaNN generates different implementations of a given image recognition model for variations in environment parameters, namely, deep learning frameworks, compiler optimizations and hardware devices and analyzes differences in model performance as a result. Using DeltaNN, we conduct an empirical study of robustness analysis of three popular image recognition models using the ImageNet dataset. We report the impact in terms of misclassifications and inference time differences across different settings. In total, we observed up to 100% output label differences across deep learning frameworks, and up to 81% unexpected performance degradation in terms of inference time, when applying compiler optimizations.
comment: 11 pages, 10 figures, 2 tables
♻ ☆ Investigating Robustness in Cyber-Physical Systems: Specification-Centric Analysis in the face of System Deviations
The adoption of cyber-physical systems (CPS) is on the rise in complex physical environments, encompassing domains such as autonomous vehicles, the Internet of Things (IoT), and smart cities. A critical attribute of CPS is robustness, denoting its capacity to operate safely despite potential disruptions and uncertainties in the operating environment. This paper proposes a novel specification-based robustness, which characterizes the effectiveness of a controller in meeting a specified system requirement, articulated through Signal Temporal Logic (STL) while accounting for possible deviations in the system. This paper also proposes the robustness falsification problem based on the definition, which involves identifying minor deviations capable of violating the specified requirement. We present an innovative two-layer simulation-based analysis framework designed to identify subtle robustness violations. To assess our methodology, we devise a series of benchmark problems wherein system parameters can be adjusted to emulate various forms of uncertainties and disturbances. Initial evaluations indicate that our falsification approach proficiently identifies robustness violations, providing valuable insights for comparing robustness between conventional and reinforcement learning (RL)-based controllers
comment: 12 pages
♻ ☆ Unveiling the Blind Spots: A Critical Examination of Fairness in Autonomous Driving Systems
Autonomous driving systems have extended the spectrum of Web of Things for intelligent vehicles and have become an important component of the Web ecosystem. Similar to traditional Web-based applications, fairness is an essential aspect for ensuring the high quality of autonomous driving systems, particularly in the context of pedestrian detectors within them. However, there is an absence in the literature of a comprehensive assessment of the fairness of current Deep Learning (DL)-based pedestrian detectors. To fill the gap, we evaluate eight widely-explored DL-based pedestrian detectors across demographic groups on large-scale real-world datasets. To enable a thorough fairness evaluation, we provide extensive annotations for the datasets, resulting in 8,311 images with 16,070 gender labels, 20,115 age labels, and 3,513 skin tone labels. Our findings reveal significant fairness issues related to age. The undetected proportions for adults are 20.14% lower compared to children. Furthermore, we explore how various driving scenarios affect the fairness of pedestrian detectors. We find that the bias may exacerbate for children and females towards low brightness and low contrast.
comment: Update the models evaluated and the experimental results
Programming Languages
☆ Synapse: Learning Preferential Concepts from Visual Demonstrations
This paper addresses the problem of preference learning, which aims to learn user-specific preferences (e.g., "good parking spot", "convenient drop-off location") from visual input. Despite its similarity to learning factual concepts (e.g., "red cube"), preference learning is a fundamentally harder problem due to its subjective nature and the paucity of person-specific training data. We address this problem using a new framework called Synapse, which is a neuro-symbolic approach designed to efficiently learn preferential concepts from limited demonstrations. Synapse represents preferences as neuro-symbolic programs in a domain-specific language (DSL) that operates over images, and leverages a novel combination of visual parsing, large language models, and program synthesis to learn programs representing individual preferences. We evaluate Synapse through extensive experimentation including a user case study focusing on mobility-related concepts in mobile robotics and autonomous driving. Our evaluation demonstrates that Synapse significantly outperforms existing baselines as well as its own ablations. The code and other details can be found on the project website https://amrl.cs.utexas.edu/synapse .
comment: 23 pages, 7 figures; Preprint
☆ ChatDBG: An AI-Powered Debugging Assistant
This paper presents ChatDBG, the first AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to take the wheel and drive debugging by issuing commands to navigate through stacks and inspect program state; it then reports its findings and yields back control to the programmer. Our ChatDBG prototype integrates with standard debuggers including LLDB, GDB, and WinDBG for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded nearly 30,000 times.
comment: 11 pages
♻ ☆ Solving Data-centric Tasks using Large Language Models NAACL 2024
Large language models (LLMs) are rapidly replacing help forums like StackOverflow, and are especially helpful for non-professional programmers and end users. These users are often interested in data-centric tasks, such as spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including the data. But how do we decide how much data and which data to include in the prompt? This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, our cluster-then-select technique outperforms a random selection baseline.
comment: Paper accepted to NAACL 2024 (Findings)
♻ ☆ Typed compositional quantum computation with lenses
We propose a type-theoretic framework for describing and proving properties of quantum computations, in particular those presented as quantum circuits. Our proposal is based on an observation that, in the polymorphic type system of Coq, currying on quantum states allows us to apply quantum gates directly inside a complex circuit. By introducing a discrete notion of lens to control this currying, we are further able to separate the combinatorics of the circuit structure from the computational content of gates. We apply our development to define quantum circuits recursively from the bottom up, and prove their correctness compositionally.
♻ ☆ Free Doubly-Infinitary Distributive Categories are Cartesian Closed
We delve into the concept of categories with products that distribute over coproducts, which we call doubly-infinitary distributive categories. We show various instances of doubly-infinitary distributive categories aiming for a comparative analysis with established notions such as extensivity, infinitary distributiveness, and cartesian closedness. Our exploration reveals that this condition represents a substantial extension beyond the classical understanding of infinitary distributive categories. Our main theorem establishes that free doubly-infinitary distributive categories are cartesian closed. We end the paper with remarks on non-canonical isomorphisms, open questions, and future work.
comment: 15 pages, minor, typos, Cantor space
Formal Languages and Automata Theory
☆ Presenting Interval Pomsets with Interfaces
Interval-order partially ordered multisets with interfaces (ipomsets) have shown to be a versatile model for executions of concurrent systems in which both precedence and concurrency need to be taken into account. In this paper, we develop a presentation of ipomsets as generated by a graph of certain discrete ipomsets (starters and terminators) under the relation which composes subsequent starters and subsequent terminators. Using this presentation, we show that also subsumptions are generated by elementary relations. We develop a similar correspondence on the automata side, relating higher-dimensional automata, which generate ipomsets, and ST-automata, which generate step sequences, and their respective languages.
comment: Submitted to RAMICS 2024, 16 pages + reference
Symbolic Computation
☆ Two Algorithms for Computing Rational Univariate Representations of Zero-Dimensional Ideals with Parameters
Two algorithms for computing the rational univariate representation of zero-dimensional ideals with parameters are presented in the paper. Different from the rational univariate representation of zero-dimensional ideals without parameters, the number of zeros of zero-dimensional ideals with parameters under various specializations is different, which leads to choosing and checking the separating element, the key to computing the rational univariate representation, is difficult. In order to pick out the separating element, by partitioning the parameter space we can ensure that under each branch the ideal has the same number of zeros. Subsequently with the help of the extended subresultant theorem for parametric cases, two ideas are given to conduct the further partition of parameter space for choosing and checking the separating element. Based on these, we give two algorithms for computing rational univariate representations of zero-dimensional ideals with parameters. Furthermore, the two algorithms have been implemented on the computer algebra system Singular. Experimental data show that the second algorithm has the better performance in contrast to the first one.
Computer Vision and Pattern Recognition
☆ Creating a Digital Twin of Spinal Surgery: A Proof of Concept
Surgery digitalization is the process of creating a virtual replica of real-world surgery, also referred to as a surgical digital twin (SDT). It has significant applications in various fields such as education and training, surgical planning, and automation of surgical tasks. Given their detailed representations of surgical procedures, SDTs are an ideal foundation for machine learning methods, enabling automatic generation of training data. In robotic surgery, SDTs can provide realistic virtual environments in which robots may learn through trial and error. In this paper, we present a proof of concept (PoC) for surgery digitalization that is applied to an ex-vivo spinal surgery performed in realistic conditions. The proposed digitalization focuses on the acquisition and modelling of the geometry and appearance of the entire surgical scene. We employ five RGB-D cameras for dynamic 3D reconstruction of the surgeon, a high-end camera for 3D reconstruction of the anatomy, an infrared stereo camera for surgical instrument tracking, and a laser scanner for 3D reconstruction of the operating room and data fusion. We justify the proposed methodology, discuss the challenges faced and further extensions of our prototype. While our PoC partially relies on manual data curation, its high quality and great potential motivate the development of automated methods for the creation of SDTs. The quality of our SDT can be assessed in a rendered video available at https://youtu.be/LqVaWGgaTMY .
☆ DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization
Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Recent work, PromptStyler, employs text prompts to simulate different distribution shifts in the joint vision-language space, allowing the model to generalize effectively to unseen domains without using any images. However, 1) PromptStyler's style generation strategy has limitations, as all style patterns are fixed after the first training phase. This leads to the training set in the second training phase being restricted to a limited set of styles. Additionally, 2) the frozen text encoder in PromptStyler result in the encoder's output varying with the style of the input text prompts, making it difficult for the model to learn domain-invariant features. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.
☆ Assessing the Performance of Deep Learning for Automated Gleason Grading in Prostate Cancer
Prostate cancer is a dominant health concern calling for advanced diagnostic tools. Utilizing digital pathology and artificial intelligence, this study explores the potential of 11 deep neural network architectures for automated Gleason grading in prostate carcinoma focusing on comparing traditional and recent architectures. A standardized image classification pipeline, based on the AUCMEDI framework, facilitated robust evaluation using an in-house dataset consisting of 34,264 annotated tissue tiles. The results indicated varying sensitivity across architectures, with ConvNeXt demonstrating the strongest performance. Notably, newer architectures achieved superior performance, even though with challenges in differentiating closely related Gleason grades. The ConvNeXt model was capable of learning a balance between complexity and generalizability. Overall, this study lays the groundwork for enhanced Gleason grading systems, potentially improving diagnostic efficiency for prostate cancer.
☆ Synapse: Learning Preferential Concepts from Visual Demonstrations
This paper addresses the problem of preference learning, which aims to learn user-specific preferences (e.g., "good parking spot", "convenient drop-off location") from visual input. Despite its similarity to learning factual concepts (e.g., "red cube"), preference learning is a fundamentally harder problem due to its subjective nature and the paucity of person-specific training data. We address this problem using a new framework called Synapse, which is a neuro-symbolic approach designed to efficiently learn preferential concepts from limited demonstrations. Synapse represents preferences as neuro-symbolic programs in a domain-specific language (DSL) that operates over images, and leverages a novel combination of visual parsing, large language models, and program synthesis to learn programs representing individual preferences. We evaluate Synapse through extensive experimentation including a user case study focusing on mobility-related concepts in mobile robotics and autonomous driving. Our evaluation demonstrates that Synapse significantly outperforms existing baselines as well as its own ablations. The code and other details can be found on the project website https://amrl.cs.utexas.edu/synapse .
comment: 23 pages, 7 figures; Preprint
☆ DeepGleason: a System for Automated Gleason Grading of Prostate Cancer using Deep Neural Networks
Advances in digital pathology and artificial intelligence (AI) offer promising opportunities for clinical decision support and enhancing diagnostic workflows. Previous studies already demonstrated AI's potential for automated Gleason grading, but lack state-of-the-art methodology and model reusability. To address this issue, we propose DeepGleason: an open-source deep neural network based image classification system for automated Gleason grading using whole-slide histopathology images from prostate tissue sections. Implemented with the standardized AUCMEDI framework, our tool employs a tile-wise classification approach utilizing fine-tuned image preprocessing techniques in combination with a ConvNeXt architecture which was compared to various state-of-the-art architectures. The neural network model was trained and validated on an in-house dataset of 34,264 annotated tiles from 369 prostate carcinoma slides. We demonstrated that DeepGleason is capable of highly accurate and reliable Gleason grading with a macro-averaged F1-score of 0.806, AUC of 0.991, and Accuracy of 0.974. The internal architecture comparison revealed that the ConvNeXt model was superior performance-wise on our dataset to established and other modern architectures like transformers. Furthermore, we were able to outperform the current state-of-the-art in tile-wise fine-classification with a sensitivity and specificity of 0.94 and 0.98 for benign vs malignant detection as well as of 0.91 and 0.75 for Gleason 3 vs Gleason 4 & 5 classification, respectively. Our tool contributes to the wider adoption of AI-based Gleason grading within the research community and paves the way for broader clinical application of deep learning models in digital pathology. DeepGleason is open-source and publicly available for research application in the following Git repository: https://github.com/frankkramer-lab/DeepGleason.
☆ FOOL: Addressing the Downlink Bottleneck in Satellite Computing with Neural Feature Compression
Nanosatellite constellations equipped with sensors capturing large geographic regions provide unprecedented opportunities for Earth observation. As constellation sizes increase, network contention poses a downlink bottleneck. Orbital Edge Computing (OEC) leverages limited onboard compute resources to reduce transfer costs by processing the raw captures at the source. However, current solutions have limited practicability due to reliance on crude filtering methods or over-prioritizing particular downstream tasks. This work presents FOOL, an OEC-native and task-agnostic feature compression method that preserves prediction performance. FOOL partitions high-resolution satellite imagery to maximize throughput. Further, it embeds context and leverages inter-tile dependencies to lower transfer costs with negligible overhead. While FOOL is a feature compressor, it can recover images with competitive scores on perceptual quality measures at lower bitrates. We extensively evaluate transfer cost reduction by including the peculiarity of intermittently available network connections in low earth orbit. Lastly, we test the feasibility of our system for standardized nanosatellite form factors. We demonstrate that FOOL permits downlinking over 100x the data volume without relying on prior information on the downstream tasks.
comment: 18 pages, double column, 19 figures, 7 tables, Initial Submission to IEEE Transactions on Mobile Computing
☆ Domain Adaptive Detection of MAVs: A Benchmark and Noise Suppression Network
Visual detection of Micro Air Vehicles (MAVs) has attracted increasing attention in recent years due to its important application in various tasks. The existing methods for MAV detection assume that the training set and testing set have the same distribution. As a result, when deployed in new domains, the detectors would have a significant performance degradation due to domain discrepancy. In this paper, we study the problem of cross-domain MAV detection. The contributions of this paper are threefold. 1) We propose a Multi-MAV-Multi-Domain (M3D) dataset consisting of both simulation and realistic images. Compared to other existing datasets, the proposed one is more comprehensive in the sense that it covers rich scenes, diverse MAV types, and various viewing angles. A new benchmark for cross-domain MAV detection is proposed based on the proposed dataset. 2) We propose a Noise Suppression Network (NSN) based on the framework of pseudo-labeling and a large-to-small training procedure. To reduce the challenging pseudo-label noises, two novel modules are designed in this network. The first is a prior-based curriculum learning module for allocating adaptive thresholds for pseudo labels with different difficulties. The second is a masked copy-paste augmentation module for pasting truly-labeled MAVs on unlabeled target images and thus decreasing pseudo-label noises. 3) Extensive experimental results verify the superior performance of the proposed method compared to the state-of-the-art ones. In particular, it achieves mAP of 46.9%(+5.8%), 50.5%(+3.7%), and 61.5%(+11.3%) on the tasks of simulation-to-real adaptation, cross-scene adaptation, and cross-camera adaptation, respectively.
comment: 17 pages, 11 figures. Accepted by IEEE Transactions on Automation Science and Engineering
☆ Clustering Propagation for Universal Medical Image Segmentation CVPR2024
Prominent solutions for medical image segmentation are typically tailored for automatic or interactive setups, posing challenges in facilitating progress achieved in one task to another.$_{\!}$ This$_{\!}$ also$_{\!}$ necessitates$_{\!}$ separate$_{\!}$ models for each task, duplicating both training time and parameters.$_{\!}$ To$_{\!}$ address$_{\!}$ above$_{\!}$ issues,$_{\!}$ we$_{\!}$ introduce$_{\!}$ S2VNet,$_{\!}$ a$_{\!}$ universal$_{\!}$ framework$_{\!}$ that$_{\!}$ leverages$_{\!}$ Slice-to-Volume$_{\!}$ propagation$_{\!}$ to$_{\!}$ unify automatic/interactive segmentation within a single model and one training session. Inspired by clustering-based segmentation techniques, S2VNet makes full use of the slice-wise structure of volumetric data by initializing cluster centers from the cluster$_{\!}$ results$_{\!}$ of$_{\!}$ previous$_{\!}$ slice.$_{\!}$ This enables knowledge acquired from prior slices to assist in the segmentation of the current slice, further efficiently bridging the communication between remote slices using mere 2D networks. Moreover, such a framework readily accommodates interactive segmentation with no architectural change, simply by initializing centroids from user inputs. S2VNet distinguishes itself by swift inference speeds and reduced memory consumption compared to prevailing 3D solutions. It can also handle multi-class interactions with each of them serving to initialize different centroids. Experiments on three benchmarks demonstrate S2VNet surpasses task-specified solutions on both automatic/interactive setups.
comment: Accepted by CVPR2024
☆ Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution
Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content, eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail, they are prone to artifact introduction during iterative procedures. Such artifacts, ranging from trivial noise to unauthentic textures, deviate from the true structure of the source image, thus challenging the integrity of the super-resolution process. In this work, we propose Self-Adaptive Reality-Guided Diffusion (SARGD), a training-free method that delves into the latent space to effectively identify and mitigate the propagation of artifacts. Our SARGD begins by using an artifact detector to identify implausible pixels, creating a binary mask that highlights artifacts. Following this, the Reality Guidance Refinement (RGR) process refines artifacts by integrating this mask with realistic latent representations, improving alignment with the original image. Nonetheless, initial realistic-latent representations from lower-quality images result in over-smoothing in the final output. To address this, we introduce a Self-Adaptive Guidance (SAG) mechanism. It dynamically computes a reality score, enhancing the sharpness of the realistic latent. These alternating mechanisms collectively achieve artifact-free super-resolution. Extensive experiments demonstrate the superiority of our method, delivering detailed artifact-free high-resolution images while reducing sampling steps by 2X. We release our code at https://github.com/ProAirVerse/Self-Adaptive-Guidance-Diffusion.git.
☆ Multi-Scale Texture Loss for CT denoising with GANs
Generative Adversarial Networks (GANs) have proved as a powerful framework for denoising applications in medical imaging. However, GAN-based denoising algorithms still suffer from limitations in capturing complex relationships within the images. In this regard, the loss function plays a crucial role in guiding the image generation process, encompassing how much a synthetic image differs from a real image. To grasp highly complex and non-linear textural relationships in the training process, this work presents a loss function that leverages the intrinsic multi-scale nature of the Gray-Level-Co-occurrence Matrix (GLCM). Although the recent advances in deep learning have demonstrated superior performance in classification and detection tasks, we hypothesize that its information content can be valuable when integrated into GANs' training. To this end, we propose a differentiable implementation of the GLCM suited for gradient-based optimization. Our approach also introduces a self-attention layer that dynamically aggregates the multi-scale texture information extracted from the images. We validate our approach by carrying out extensive experiments in the context of low-dose CT denoising, a challenging application that aims to enhance the quality of noisy CT scans. We utilize three publicly available datasets, including one simulated and two real datasets. The results are promising as compared to other well-established loss functions, being also consistent across three different GAN architectures. The code is available at: https://github.com/FrancescoDiFeola/DenoTextureLoss
☆ AI-Generated Video Detection via Spatio-Temporal Anomaly Learning
The advancement of generation models has led to the emergence of highly realistic artificial intelligence (AI)-generated videos. Malicious users can easily create non-existent videos to spread false information. This letter proposes an effective AI-generated video detection (AIGVDet) scheme by capturing the forensic traces with a two-branch spatio-temporal convolutional neural network (CNN). Specifically, two ResNet sub-detectors are learned separately for identifying the anomalies in spatical and optical flow domains, respectively. Results of such sub-detectors are fused to further enhance the discrimination ability. A large-scale generated video dataset (GVD) is constructed as a benchmark for model training and evaluation. Extensive experimental results verify the high generalization and robustness of our AIGVDet scheme. Code and dataset will be available at https://github.com/multimediaFor/AIGVDet.
☆ V2X-PC: Vehicle-to-everything Collaborative Perception via Point Cluster
The objective of the collaborative vehicle-to-everything perception task is to enhance the individual vehicle's perception capability through message communication among neighboring traffic agents. Previous methods focus on achieving optimal performance within bandwidth limitations and typically adopt BEV maps as the basic collaborative message units. However, we demonstrate that collaboration with dense representations is plagued by object feature destruction during message packing, inefficient message aggregation for long-range collaboration, and implicit structure representation communication. To tackle these issues, we introduce a brand new message unit, namely point cluster, designed to represent the scene sparsely with a combination of low-level structure information and high-level semantic information. The point cluster inherently preserves object information while packing messages, with weak relevance to the collaboration range, and supports explicit structure modeling. Building upon this representation, we propose a novel framework V2X-PC for collaborative perception. This framework includes a Point Cluster Packing (PCP) module to keep object feature and manage bandwidth through the manipulation of cluster point numbers. As for effective message aggregation, we propose a Point Cluster Aggregation (PCA) module to match and merge point clusters associated with the same object. To further handle time latency and pose errors encountered in real-world scenarios, we propose parameter-free solutions that can adapt to different noisy levels without finetuning. Experiments on two widely recognized collaborative perception benchmarks showcase the superior performance of our method compared to the previous state-of-the-art approaches relying on BEV maps.
☆ SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FP (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.
☆ Calibrating Bayesian UNet++ for Sub-Seasonal Forecasting ICLR 2024
Seasonal forecasting is a crucial task when it comes to detecting the extreme heat and colds that occur due to climate change. Confidence in the predictions should be reliable since a small increase in the temperatures in a year has a big impact on the world. Calibration of the neural networks provides a way to ensure our confidence in the predictions. However, calibrating regression models is an under-researched topic, especially in forecasters. We calibrate a UNet++ based architecture, which was shown to outperform physics-based models in temperature anomalies. We show that with a slight trade-off between prediction error and calibration error, it is possible to get more reliable and sharper forecasts. We believe that calibration should be an important part of safety-critical machine learning applications such as weather forecasters.
comment: Accepted as a workshop paper at "ICLR 2024 Tackling Climate Change with Machine Learning"
☆ Enhancing Industrial Transfer Learning with Style Filter: Cost Reduction and Defect-Focus
Addressing the challenge of data scarcity in industrial domains, transfer learning emerges as a pivotal paradigm. This work introduces Style Filter, a tailored methodology for industrial contexts. By selectively filtering source domain data before knowledge transfer, Style Filter reduces the quantity of data while maintaining or even enhancing the performance of transfer learning strategy. Offering label-free operation, minimal reliance on prior knowledge, independence from specific models, and re-utilization, Style Filter is evaluated on authentic industrial datasets, highlighting its effectiveness when employed before conventional transfer strategies in the deep learning domain. The results underscore the effectiveness of Style Filter in real-world industrial applications.
comment: 17 pages, 11 figures,4 tables
☆ SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation CVPR2024
In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data.
comment: Accepted to CVPR2024
☆ EDUE: Expert Disagreement-Guided One-Pass Uncertainty Estimation for Medical Image Segmentation
Deploying deep learning (DL) models in medical applications relies on predictive performance and other critical factors, such as conveying trustworthy predictive uncertainty. Uncertainty estimation (UE) methods provide potential solutions for evaluating prediction reliability and improving the model confidence calibration. Despite increasing interest in UE, challenges persist, such as the need for explicit methods to capture aleatoric uncertainty and align uncertainty estimates with real-life disagreements among domain experts. This paper proposes an Expert Disagreement-Guided Uncertainty Estimation (EDUE) for medical image segmentation. By leveraging variability in ground-truth annotations from multiple raters, we guide the model during training and incorporate random sampling-based strategies to enhance calibration confidence. Our method achieves 55% and 23% improvement in correlation on average with expert disagreements at the image and pixel levels, respectively, better calibration, and competitive segmentation performance compared to the state-of-the-art deep ensembles, requiring only a single forward pass.
☆ In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Crop classification is of critical importance due to its role in studying crop pattern changes, resource management, and carbon sequestration. When employing data-driven techniques for its prediction, utilizing various temporal data sources is necessary. Deep learning models have proven to be effective for this task by mapping time series data to high-level representation for prediction. However, they face substantial challenges when dealing with multiple input patterns. The literature offers limited guidance for Multi-View Learning (MVL) scenarios, as it has primarily focused on exploring fusion strategies with specific encoders and validating them in local regions. In contrast, we investigate the impact of simultaneous selection of the fusion strategy and the encoder architecture evaluated on a global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoder architectures (LSTM, GRU, TempCNN, TAE, L-TAE) as possible MVL model configurations. The validation is on the CropHarvest dataset that provides optical, radar, and weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination, including encoder and fusion strategy, should be meticulously sought. To streamline this search process, we suggest initially identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a technical framework for researchers exploring crop classification or related tasks through a MVL approach.
comment: submitted to journal
☆ SegICL: A Universal In-context Learning Framework for Enhanced Segmentation in Medical Imaging
Medical image segmentation models adapting to new tasks in a training-free manner through in-context learning is an exciting advancement. Universal segmentation models aim to generalize across the diverse modality of medical images, yet their effectiveness often diminishes when applied to out-of-distribution (OOD) data modalities and tasks, requiring intricate fine-tuning of model for optimal performance. For addressing this challenge, we introduce SegICL, a novel approach leveraging In-Context Learning (ICL) for image segmentation. Unlike existing methods, SegICL has the capability to employ text-guided segmentation and conduct in-context learning with a small set of image-mask pairs, eliminating the need for training the model from scratch or fine-tuning for OOD tasks (including OOD modality and dataset). Extensive experimental validation of SegICL demonstrates a positive correlation between the number of prompt samples and segmentation performance on OOD modalities and tasks. This indicates that SegICL effectively address new segmentation tasks based on contextual information. Additionally, SegICL also exhibits comparable segmentation performance to mainstream models on OOD and in-distribution tasks. Our code will be released soon.
☆ Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors
Explainable Artificial Intelligence (XAI) strategies play a crucial part in increasing the understanding and trustworthiness of neural networks. Nonetheless, these techniques could potentially generate misleading explanations. Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation, providing misleading information by adding visually unnoticeable artifacts into the input, while maintaining the model's accuracy. It poses a serious challenge in ensuring the reliability of XAI methods. To ensure the reliability of XAI methods poses a real challenge, we leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks. We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase, avoiding the need for extra training. The method we suggest defences against most modern explanation-aware adversarial attacks, achieving an approximate decrease of ~99\% in the Attack Success Rate (ASR) and a ~91\% reduction in the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.
☆ Elysium: Exploring Object-level Perception in Videos via MLLM
Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that makes the first attempt to conduct object-level tasks in videos without requiring any additional plug-in or expert models.
☆ QKFormer: Hierarchical Spiking Transformer using Q-K Attention
Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer
comment: 10 pages, code: https://github.com/zhouchenlin2096/QKFormer
☆ DOrA: 3D Visual Grounding with Order-Aware Referring
3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. While previous works attempt to exploit the verbo-visual relation with proposed cross-modal transformers, unstructured natural utterances and scattered objects might lead to undesirable performances. In this paper, we introduce DOrA, a novel 3D visual grounding framework with Order-Aware referring. DOrA is designed to leverage Large Language Models (LLMs) to parse language description, suggesting a referential order of anchor objects. Such ordered anchor objects allow DOrA to update visual features and locate the target object during the grounding process. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in both low-resource and full-data scenarios. In particular, DOrA surpasses current state-of-the-art frameworks by 9.3% and 7.8% grounding accuracy under 1% data and 10% data settings, respectively.
☆ VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting
Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyujintang/VMRNN-PyTorch.
comment: 11 pages, 7 figures. arXiv admin note: text overlap with arXiv:2308.09891 by other authors
☆ An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models
Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.
☆ Open-Set Recognition in the Age of Vision-Language Models
Are vision-language models (VLMs) open-set models because they are trained on internet-scale datasets? We answer this question with a clear no - VLMs introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions. We systematically evaluate VLMs for open-set recognition and find they frequently misclassify objects not contained in their query set, leading to alarmingly low precision when tuned for high recall and vice versa. We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance. We establish a revised definition of the open-set problem for the age of VLMs, define a new benchmark and evaluation protocol to facilitate standardised evaluation and research in this important area, and evaluate promising baseline approaches based on predictive uncertainty and dedicated negative embeddings on a range of VLM classifiers and object detectors.
comment: 31 pages, under review
☆ ModeTv2: GPU-accelerated Motion Decomposition Transformer for Pairwise Optimization in Medical Image Registration
Deformable image registration plays a crucial role in medical imaging, aiding in disease diagnosis and image-guided interventions. Traditional iterative methods are slow, while deep learning (DL) accelerates solutions but faces usability and precision challenges. This study introduces a pyramid network with the enhanced motion decomposition Transformer (ModeTv2) operator, showcasing superior pairwise optimization (PO) akin to traditional methods. We re-implement ModeT operator with CUDA extensions to enhance its computational efficiency. We further propose RegHead module which refines deformation fields, improves the realism of deformation and reduces parameters. By adopting the PO, the proposed network balances accuracy, efficiency, and generalizability. Extensive experiments on two public brain MRI datasets and one abdominal CT dataset demonstrate the network's suitability for PO, providing a DL model with enhanced usability and interpretability. The code is publicly available.
☆ CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification
Alzheimer's disease (AD) is an incurable neurodegenerative condition leading to cognitive and functional deterioration. Given the lack of a cure, prompt and precise AD diagnosis is vital, a complex process dependent on multiple factors and multi-modal data. While successful efforts have been made to integrate multi-modal representation learning into medical datasets, scant attention has been given to 3D medical images. In this paper, we propose Contrastive Masked Vim Autoencoder (CMViM), the first efficient representation learning method tailored for 3D multi-modal data. Our proposed framework is built on a masked Vim autoencoder to learn a unified multi-modal representation and long-dependencies contained in 3D medical images. We also introduce an intra-modal contrastive learning module to enhance the capability of the multi-modal Vim encoder for modeling the discriminative features in the same modality, and an inter-modal contrastive learning module to alleviate misaligned representation among modalities. Our framework consists of two main steps: 1) incorporate the Vision Mamba (Vim) into the mask autoencoder to reconstruct 3D masked multi-modal data efficiently. 2) align the multi-modal representations with contrastive learning mechanisms from both intra-modal and inter-modal aspects. Our framework is pre-trained and validated ADNI2 dataset and validated on the downstream task for AD classification. The proposed CMViM yields 2.7\% AUC performance improvement compared with other state-of-the-art methods.
comment: 11 pages, 1 figure
☆ Visually Guided Generative Text-Layout Pre-training for Document Intelligence NAACL 2024
Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.
comment: Accepted to NAACL 2024 main conference. The first version of this paper was submitted to OpenReview (https://openreview.net/forum?id=ARtBIBAmNR) in June 2023
☆ Let Real Images be as a Judger, Spotting Fake Images Synthesized with Generative Models
In the last few years, generative models have shown their powerful capabilities in synthesizing realistic images in both quality and diversity (i.e., facial images, and natural subjects). Unfortunately, the artifact patterns in fake images synthesized by different generative models are inconsistent, leading to the failure of previous research that relied on spotting subtle differences between real and fake. In our preliminary experiments, we find that the artifacts in fake images always change with the development of the generative model, while natural images exhibit stable statistical properties. In this paper, we employ natural traces shared only by real images as an additional predictive target in the detector. Specifically, the natural traces are learned from the wild real images and we introduce extended supervised contrastive learning to bring them closer to real images and further away from fake ones. This motivates the detector to make decisions based on the proximity of images to the natural traces. To conduct a comprehensive experiment, we built a high-quality and diverse dataset that includes generative models comprising 6 GAN and 6 diffusion models, to evaluate the effectiveness in generalizing unknown forgery techniques and robustness in surviving different transformations. Experimental results show that our proposed method gives 96.1% mAP significantly outperforms the baselines. Extensive experiments conducted on the widely recognized platform Midjourney reveal that our proposed method achieves an accuracy exceeding 78.4%, underscoring its practicality for real-world application deployment. The source code and partial self-built dataset are available in supplementary material.
☆ Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework CVPR2024
Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively binding movements with specific appearances. To produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods. Project page: \url{https://github.com/ICTMCG/Make-Your-Anchor}.
comment: accepted at CVPR2024
☆ Medical Image Registration and Its Application in Retinal Images: A Review
Medical image registration is vital for disease diagnosis and treatment with its ability to merge diverse information of images, which may be captured under different times, angles, or modalities. Although several surveys have reviewed the development of medical image registration, these surveys have not systematically summarized methodologies of existing medical image registration methods. To this end, we provide a comprehensive review of these methods from traditional and deep learning-based directions, aiming to help audiences understand the development of medical image registration quickly. In particular, we review recent advances in retinal image registration at the end of each section, which has not attracted much attention. Additionally, we also discuss the current challenges of retinal image registration and provide insights and prospects for future research.
Self-Supervised Learning for Medical Image Data with Anatomy-Oriented Imaging Planes
Self-supervised learning has emerged as a powerful tool for pretraining deep networks on unlabeled data, prior to transfer learning of target tasks with limited annotation. The relevance between the pretraining pretext and target tasks is crucial to the success of transfer learning. Various pretext tasks have been proposed to utilize properties of medical image data (e.g., three dimensionality), which are more relevant to medical image analysis than generic ones for natural images. However, previous work rarely paid attention to data with anatomy-oriented imaging planes, e.g., standard cardiac magnetic resonance imaging views. As these imaging planes are defined according to the anatomy of the imaged organ, pretext tasks effectively exploiting this information can pretrain the networks to gain knowledge on the organ of interest. In this work, we propose two complementary pretext tasks for this group of medical image data based on the spatial relationship of the imaging planes. The first is to learn the relative orientation between the imaging planes and implemented as regressing their intersecting lines. The second exploits parallel imaging planes to regress their relative slice locations within a stack. Both pretext tasks are conceptually straightforward and easy to implement, and can be combined in multitask learning for better representation learning. Thorough experiments on two anatomical structures (heart and knee) and representative target tasks (semantic segmentation and classification) demonstrate that the proposed pretext tasks are effective in pretraining deep networks for remarkably boosted performance on the target tasks, and superior to other recent approaches.
comment: Medical Image Analysis
☆ PathoTune: Adapting Visual Foundation Model to Pathological Specialists MICCAI 2024
As natural image understanding moves towards the pretrain-finetune era, research in pathology imaging is concurrently evolving. Despite the predominant focus on pretraining pathological foundation models, how to adapt foundation models to downstream tasks is little explored. For downstream adaptation, we propose the existence of two domain gaps, i.e., the Foundation-Task Gap and the Task-Instance Gap. To mitigate these gaps, we introduce PathoTune, a framework designed to efficiently adapt pathological or even visual foundation models to pathology-specific tasks via multi-modal prompt tuning. The proposed framework leverages Task-specific Visual Prompts and Task-specific Textual Prompts to identify task-relevant features, along with Instance-specific Visual Prompts for encoding single pathological image features. Results across multiple datasets at both patch-level and WSI-level demonstrate its superior performance over single-modality prompt tuning approaches. Significantly, PathoTune facilitates the direct adaptation of natural visual foundation models to pathological tasks, drastically outperforming pathological foundation models with simple linear probing. The code will be available upon acceptance.
comment: Submitted to MICCAI 2024
☆ CT-Bound: Fast Boundary Estimation From Noisy Images Via Hybrid Convolution and Transformer Neural Networks
We present CT-Bound, a fast boundary estimation method for noisy images using a hybrid Convolution and Transformer neural network. The proposed architecture decomposes boundary estimation into two tasks: local detection and global regularization of image boundaries. It first estimates a parametric representation of boundary structures only using the input image within a small receptive field and then refines the boundary structure in the parameter domain without accessing the input image. Because of this, a part of the network can be easily trained using naive, synthetic images and still generalized to real images, and the entire architecture is computationally efficient as the boundary refinement is non-iterative and not in the image domain. Compared with the previous highest accuracy methods, our experiment shows that CT-Bound is 100 times faster, producing comparably accurate, high-quality boundary and color maps. We also demonstrate that CT-Bound can produce boundary and color maps on real captured images without extra fine-tuning and real-time boundary map and color map videos at ten frames per second.
comment: 8 pages, 6 figures
☆ REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices
This work tackles the challenging task of achieving real-time novel view synthesis on various scenes, including highly reflective objects and unbounded outdoor scenes. Existing real-time rendering methods, especially those based on meshes, often have subpar performance in modeling surfaces with rich view-dependent appearances. Our key idea lies in leveraging meshes for rendering acceleration while incorporating a novel approach to parameterize view-dependent information. We decompose the color into diffuse and specular, and model the specular color in the reflected direction based on a neural environment map. Our experiments demonstrate that our method achieves comparable reconstruction quality for highly reflective surfaces compared to state-of-the-art offline methods, while also efficiently enabling real-time rendering on edge devices such as smartphones.
comment: Project Page:https://xdimlab.github.io/REFRAME/
☆ Camera-aware Label Refinement for Unsupervised Person Re-identification
Unsupervised person re-identification aims to retrieve images of a specified person without identity labels. Many recent unsupervised Re-ID approaches adopt clustering-based methods to measure cross-camera feature similarity to roughly divide images into clusters. They ignore the feature distribution discrepancy induced by camera domain gap, resulting in the unavoidable performance degradation. Camera information is usually available, and the feature distribution in the single camera usually focuses more on the appearance of the individual and has less intra-identity variance. Inspired by the observation, we introduce a \textbf{C}amera-\textbf{A}ware \textbf{L}abel \textbf{R}efinement~(CALR) framework that reduces camera discrepancy by clustering intra-camera similarity. Specifically, we employ intra-camera training to obtain reliable local pseudo labels within each camera, and then refine global labels generated by inter-camera clustering and train the discriminative model using more reliable global pseudo labels in a self-paced manner. Meanwhile, we develop a camera-alignment module to align feature distributions under different cameras, which could help deal with the camera variance further. Extensive experiments validate the superiority of our proposed method over state-of-the-art approaches. The code is accessible at https://github.com/leeBooMla/CALR.
comment: submitted to IEEE TMM
☆ If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize important textual features for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate the important features for the VLM. Then, we inspect the descriptions to identify the features that contribute to VLM representations. We find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
comment: Code: https://github.com/BatsResearch/ex2
☆ RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection CVPR2024
Three-dimensional object detection is one of the key tasks in autonomous driving. To reduce costs in practice, low-cost multi-view cameras for 3D object detection are proposed to replace the expansive LiDAR sensors. However, relying solely on cameras is difficult to achieve highly accurate and robust 3D object detection. An effective solution to this issue is combining multi-view cameras with the economical millimeter-wave radar sensor to achieve more reliable multi-modal 3D object detection. In this paper, we introduce RCBEVDet, a radar-camera fusion 3D object detection method in the bird's eye view (BEV). Specifically, we first design RadarBEVNet for radar BEV feature extraction. RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section (RCS) aware BEV encoder. In the dual-stream radar backbone, a point-based encoder and a transformer-based encoder are proposed to extract radar features, with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to scattering the point feature in BEV. Besides, we present the Cross-Attention Multi-layer Fusion module to automatically align the multi-modal BEV feature from radar and camera with the deformable attention mechanism, and then fuse the feature with channel and spatial fusion layers. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore, RCBEVDet achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21~28 FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.
comment: Accepted by CVPR2024
☆ Producing and Leveraging Online Map Uncertainty in Trajectory Prediction CVPR 2024
High-definition (HD) maps have played an integral role in the development of modern autonomous vehicle (AV) stacks, albeit with high associated labeling and maintenance costs. As a result, many recent works have proposed methods for estimating HD maps online from sensor data, enabling AVs to operate outside of previously-mapped regions. However, current online map estimation approaches are developed in isolation of their downstream tasks, complicating their integration in AV stacks. In particular, they do not produce uncertainty or confidence estimates. In this work, we extend multiple state-of-the-art online map estimation methods to additionally estimate uncertainty and show how this enables more tightly integrating online mapping with trajectory forecasting. In doing so, we find that incorporating uncertainty yields up to 50% faster training convergence and up to 15% better prediction performance on the real-world nuScenes driving dataset.
comment: 14 pages, 14 figures, 6 tables. CVPR 2024
☆ Real-time Neuron Segmentation for Voltage Imaging
In voltage imaging, where the membrane potentials of individual neurons are recorded at from hundreds to thousand frames per second using fluorescence microscopy, data processing presents a challenge. Even a fraction of a minute of recording with a limited image size yields gigabytes of video data consisting of tens of thousands of frames, which can be time-consuming to process. Moreover, millisecond-level short exposures lead to noisy video frames, obscuring neuron footprints especially in deep-brain samples where noisy signals are buried in background fluorescence. To address this challenge, we propose a fast neuron segmentation method able to detect multiple, potentially overlapping, spiking neurons from noisy video frames, and implement a data processing pipeline incorporating the proposed segmentation method along with GPU-accelerated motion correction. By testing on existing datasets as well as on new datasets we introduce, we show that our pipeline extracts neuron footprints that agree well with human annotation even from cluttered datasets, and demonstrate real-time processing of voltage imaging data on a single desktop computer for the first time.
☆ DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding
Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.
☆ Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3D understanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions.
☆ Enhancing Visual Place Recognition via Fast and Slow Adaptive Biasing in Event Cameras
Event cameras are increasingly popular in robotics due to their beneficial features, such as low latency, energy efficiency, and high dynamic range. Nevertheless, their downstream task performance is greatly influenced by the optimization of bias parameters. These parameters, for instance, regulate the necessary change in light intensity to trigger an event, which in turn depends on factors such as the environment lighting and camera motion. This paper introduces feedback control algorithms that automatically tune the bias parameters through two interacting methods: 1) An immediate, on-the-fly fast adaptation of the refractory period, which sets the minimum interval between consecutive events, and 2) if the event rate exceeds the specified bounds even after changing the refractory period repeatedly, the controller adapts the pixel bandwidth and event thresholds, which stabilizes after a short period of noise events across all pixels (slow adaptation). Our evaluation focuses on the visual place recognition task, where incoming query images are compared to a given reference database. We conducted comprehensive evaluations of our algorithms' adaptive feedback control in real-time. To do so, we collected the QCR-Fast-and-Slow dataset that contains DAVIS346 event camera streams from 366 repeated traversals of a Scout Mini robot navigating through a 100 meter long indoor lab setting (totaling over 35km distance traveled) in varying brightness conditions with ground truth location information. Our proposed feedback controllers result in superior performance when compared to the standard bias settings and prior feedback control methods. Our findings also detail the impact of bias adjustments on task performance and feature ablation studies on the fast and slow adaptation mechanisms.
comment: 8 pages, 9 figures, paper under review
☆ Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation
Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature
☆ Unsupervised Template-assisted Point Cloud Shape Correspondence Network CVPR2024
Unsupervised point cloud shape correspondence aims to establish point-wise correspondences between source and target point clouds. Existing methods obtain correspondences directly by computing point-wise feature similarity between point clouds. However, non-rigid objects possess strong deformability and unusual shapes, making it a longstanding challenge to directly establish correspondences between point clouds with unconventional shapes. To address this challenge, we propose an unsupervised Template-Assisted point cloud shape correspondence Network, termed TANet, including a template generation module and a template assistance module. The proposed TANet enjoys several merits. Firstly, the template generation module establishes a set of learnable templates with explicit structures. Secondly, we introduce a template assistance module that extensively leverages the generated templates to establish more accurate shape correspondences from multiple perspectives. Extensive experiments on four human and animal datasets demonstrate that TANet achieves favorable performance against state-of-the-art methods.
comment: Accepted to CVPR2024
☆ Spike-NeRF: Neural Radiance Field Based On Spike Camera ICME2024
As a neuromorphic sensor with high temporal resolution, spike cameras offer notable advantages over traditional cameras in high-speed vision applications such as high-speed optical estimation, depth estimation, and object tracking. Inspired by the success of the spike camera, we proposed Spike-NeRF, the first Neural Radiance Field derived from spike data, to achieve 3D reconstruction and novel viewpoint synthesis of high-speed scenes. Instead of the multi-view images at the same time of NeRF, the inputs of Spike-NeRF are continuous spike streams captured by a moving spike camera in a very short time. To reconstruct a correct and stable 3D scene from high-frequency but unstable spike data, we devised spike masks along with a distinctive loss function. We evaluate our method qualitatively and numerically on several challenging synthetic scenes generated by blender with the spike camera simulator. Our results demonstrate that Spike-NeRF produces more visually appealing results than the existing methods and the baseline we proposed in high-speed scenes. Our code and data will be released soon.
comment: This paper is accepted by ICME2024
☆ A Survey on Long Video Generation: Challenges, Methods, and Prospects
Video generation is a rapidly advancing research area, garnering significant attention due to its broad range of applications. One critical aspect of this field is the generation of long-duration videos, which presents unique challenges and opportunities. This paper presents the first survey of recent advancements in long video generation and summarises them into two key paradigms: divide and conquer temporal autoregressive. We delve into the common models employed in each paradigm, including aspects of network design and conditioning techniques. Furthermore, we offer a comprehensive overview and classification of the datasets and evaluation metrics which are crucial for advancing long video generation research. Concluding with a summary of existing studies, we also discuss the emerging challenges and future directions in this dynamic field. We hope that this survey will serve as an essential reference for researchers and practitioners in the realm of long video generation.
☆ Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models IJCNN
The integration of an ensemble of deep learning models has been extensively explored to enhance defense against adversarial attacks. The diversity among sub-models increases the attack cost required to deceive the majority of the ensemble, thereby improving the adversarial robustness. While existing approaches mainly center on increasing diversity in feature representations or dispersion of first-order gradients with respect to input, the limited correlation between these diversity metrics and adversarial robustness constrains the performance of ensemble adversarial defense. In this work, we aim to enhance ensemble diversity by reducing attack transferability. We identify second-order gradients, which depict the loss curvature, as a key factor in adversarial robustness. Computing the Hessian matrix involved in second-order gradients is computationally expensive. To address this, we approximate the Hessian-vector product using differential approximation. Given that low curvature provides better robustness, our ensemble model was designed to consider the influence of curvature among different sub-models. We introduce a novel regularizer to train multiple more-diverse low-curvature network models. Extensive experiments across various datasets demonstrate that our ensemble model exhibits superior robustness against a range of attacks, underscoring the effectiveness of our approach.
comment: Accepted to The 2024 International Joint Conference on Neural Networks (IJCNN)
☆ ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation
In medical and industrial domains, providing guidance for assembly processes is critical to ensure efficiency and safety. Errors in assembly can lead to significant consequences such as extended surgery times, and prolonged manufacturing or maintenance times in industry. Assembly scenarios can benefit from in-situ AR visualization to provide guidance, reduce assembly times and minimize errors. To enable in-situ visualization 6D pose estimation can be leveraged. Existing 6D pose estimation techniques primarily focus on individual objects and static captures. However, assembly scenarios have various dynamics including occlusion during assembly and dynamics in the assembly objects appearance. Existing work, combining object detection/6D pose estimation and assembly state detection focuses either on pure deep learning-based approaches, or limit the assembly state detection to building blocks. To address the challenges of 6D pose estimation in combination with assembly state detection, our approach ASDF builds upon the strengths of YOLOv8, a real-time capable object detection framework. We extend this framework, refine the object pose and fuse pose knowledge with network-detected pose information. Utilizing our late fusion in our Pose2State module results in refined 6D pose estimation and assembly state detection. By combining both pose and state information, our Pose2State module predicts the final assembly state with precision. Our evaluation on our ASDF dataset shows that our Pose2State module leads to an improved assembly state detection and that the improvement of the assembly state further leads to a more robust 6D pose estimation. Moreover, on the GBOT dataset, we outperform the pure deep learning-based network, and even outperform the hybrid and pure tracking-based approaches.
☆ Multi-attention Associate Prediction Network for Visual Tracking
Classification-regression prediction networks have realized impressive success in several modern deep trackers. However, there is an inherent difference between classification and regression tasks, so they have diverse even opposite demands for feature matching. Existed models always ignore the key issue and only employ a unified matching block in two task branches, decaying the decision quality. Besides, these models also struggle with decision misalignment situation. In this paper, we propose a multi-attention associate prediction network (MAPNet) to tackle the above problems. Concretely, two novel matchers, i.e., category-aware matcher and spatial-aware matcher, are first designed for feature comparison by integrating self, cross, channel or spatial attentions organically. They are capable of fully capturing the category-related semantics for classification and the local spatial contexts for regression, respectively. Then, we present a dual alignment module to enhance the correspondences between two branches, which is useful to find the optimal tracking solution. Finally, we describe a Siamese tracker built upon the proposed prediction network, which achieves the leading performance on five tracking benchmarks, consisting of LaSOT, TrackingNet, GOT-10k, TNL2k and UAV123, and surpasses other state-of-the-art approaches.
☆ Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion CVPR 2024
Image fusion aims to combine information from different source images to create a comprehensively representative image. Existing fusion methods are typically helpless in dealing with degradations in low-quality source images and non-interactive to multiple subjective and objective needs. To solve them, we introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task, termed as Text-IF. It innovatively extends the classical image fusion to the text guided image fusion along with the ability to harmoniously address the degradation and interaction issues during fusion. Through the text semantic encoder and semantic interaction fusion decoder, Text-IF is accessible to the all-in-one infrared and visible image degradation-aware processing and the interactive flexible fusion outcomes. In this way, Text-IF achieves not only multi-modal image fusion, but also multi-modal information fusion. Extensive experiments prove that our proposed text guided image fusion strategy has obvious advantages over SOTA methods in the image fusion performance and degradation treatment. The code is available at https://github.com/XunpengYi/Text-IF.
comment: Accepted by CVPR 2024
☆ Dia-LLaMA: Towards Large Language Model-driven CT Report Generation
Medical report generation has achieved remarkable advancements yet has still been faced with several challenges. First, the inherent imbalance in the distribution of normal and abnormal cases may lead models to exhibit a biased focus on normal samples, resulting in unreliable diagnoses. Second, the frequent occurrence of common template sentences in the reports may overwhelm the critical abnormal information. Moreover, existing works focus on 2D chest X-rays, leaving CT report generation underexplored due to the high-dimensional nature of CT images and the limited availability of CT-report pairs. Recently, LLM has shown a great ability to generate reliable answers with appropriate prompts, which shed light on addressing the aforementioned challenges. In this paper, we propose Dia-LLaMA, a framework to adapt the LLaMA2-7B for CT report generation by incorporating diagnostic information as guidance prompts. Considering the high dimension of CT, we leverage a pre-trained ViT3D with perceiver to extract the visual information. To tailor the LLM for report generation and emphasize abnormality, we extract additional diagnostic information by referring to a disease prototype memory bank, which is updated during training to capture common disease representations. Furthermore, we introduce disease-aware attention to enable the model to adjust attention for different diseases. Experiments on the chest CT dataset demonstrated that our proposed method outperformed previous methods and achieved state-of-the-art on both clinical efficacy performance and natural language generation metrics. The code will be made publically available.
comment: 10 pages
☆ Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA CVPR 2024
Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work, we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs), which have shown to have strong reasoning ability, as an automatic data annotator that generates question-answer annotations for chart images. The key innovation in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data generator learns to decompose the complex question into step-by-step sub-questions (rationales), which are then used to derive the final answer using external tools, i.e. Python. This step-wise generation procedure is trained on synthetic data generated using a template-based QA generation pipeline. Experimental results highlight the significance of the proposed step-by-step generation. By training with the LLM-augmented data (LAMENDA), we significantly enhance the chart VQA models, achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets. In particular, our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset, which needs strong reasoning. We hope our work underscores the potential of synthetic data and encourages further exploration of data augmentation using LLMs for reasoning-heavy tasks.
comment: Accepted to CVPR 2024
☆ Residual Dense Swin Transformer for Continuous Depth-Independent Ultrasound Imaging ICASSP2024
Ultrasound imaging is crucial for evaluating organ morphology and function, yet depth adjustment can degrade image quality and field-of-view, presenting a depth-dependent dilemma. Traditional interpolation-based zoom-in techniques often sacrifice detail and introduce artifacts. Motivated by the potential of arbitrary-scale super-resolution to naturally address these inherent challenges, we present the Residual Dense Swin Transformer Network (RDSTN), designed to capture the non-local characteristics and long-range dependencies intrinsic to ultrasound images. It comprises a linear embedding module for feature enhancement, an encoder with shifted-window attention for modeling non-locality, and an MLP decoder for continuous detail reconstruction. This strategy streamlines balancing image quality and field-of-view, which offers superior textures over traditional methods. Experimentally, RDSTN outperforms existing approaches while requiring fewer parameters. In conclusion, RDSTN shows promising potential for ultrasound image enhancement by overcoming the limitations of conventional interpolation-based methods and achieving depth-independent imaging.
comment: Accepted by ICASSP2024, https://ieeexplore.ieee.org/document/10447712
☆ FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models CVPR 2024
In recent years, there has been significant progress in the development of text-to-image generative models. Evaluating the quality of the generative models is one essential step in the development process. Unfortunately, the evaluation process could consume a significant amount of computational resources, making the required periodic evaluation of model performance (e.g., monitoring training progress) impractical. Therefore, we seek to improve the evaluation efficiency by selecting the representative subset of the text-image dataset. We systematically investigate the design choices, including the selection criteria (textural features or image-based metrics) and the selection granularity (prompt-level or set-level). We find that the insights from prior work on subset selection for training data do not generalize to this problem, and we propose FlashEval, an iterative search algorithm tailored to evaluation data selection. We demonstrate the effectiveness of FlashEval on ranking diffusion models with various configurations, including architectures, quantization levels, and sampler schedules on COCO and DiffusionDB datasets. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations on unseen models, achieving a 10x evaluation speedup. We release the condensed subset of these commonly used datasets to help facilitate diffusion algorithm design and evaluation, and open-source FlashEval as a tool for condensing future datasets, accessible at https://github.com/thu-nics/FlashEval.
comment: The paper is accepted by CVPR 2024
☆ Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion CVPR2024
360 depth estimation has recently received great attention for 3D reconstruction owing to its omnidirectional field of view (FoV). Recent approaches are predominantly focused on cross-projection fusion with geometry-based re-projection: they fuse 360 images with equirectangular projection (ERP) and another projection type, e.g., cubemap projection to estimate depth with the ERP format. However, these methods suffer from 1) limited local receptive fields, making it hardly possible to capture large FoV scenes, and 2) prohibitive computational cost, caused by the complex cross-projection fusion module design. In this paper, we propose Elite360D, a novel framework that inputs the ERP image and icosahedron projection (ICOSAP) point set, which is undistorted and spatially continuous. Elite360D is superior in its capacity in learning a representation from a local-with-global perspective. With a flexible ERP image encoder, it includes an ICOSAP point encoder, and a Bi-projection Bi-attention Fusion (B2F) module (totally ~1M parameters). Specifically, the ERP image encoder can take various perspective image-trained backbones (e.g., ResNet, Transformer) to extract local features. The point encoder extracts the global features from the ICOSAP. Then, the B2F module captures the semantic- and distance-aware dependencies between each pixel of the ERP feature and the entire ICOSAP feature set. Without specific backbone design and obvious computational cost increase, Elite360D outperforms the prior arts on several benchmark datasets.
comment: 8 pages, accepted by CVPR2024
☆ GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation CVPR 2024
This paper tackles a novel yet challenging problem: how to transfer knowledge from the emerging Segment Anything Model (SAM) -- which reveals impressive zero-shot instance segmentation capacity -- to learn a compact panoramic semantic segmentation model, i.e., student, without requiring any labeled data. This poses considerable challenges due to SAM's inability to provide semantic labels and the large capacity gap between SAM and the student. To this end, we propose a novel framework, called GoodSAM, that introduces a teacher assistant (TA) to provide semantic information, integrated with SAM to generate ensemble logits to achieve knowledge transfer. Specifically, we propose a Distortion-Aware Rectification (DAR) module that first addresses the distortion problem of panoramic images by imposing prediction-level consistency and boundary enhancement. This subtly enhances TA's prediction capacity on panoramic images. DAR then incorporates a cross-task complementary fusion block to adaptively merge the predictions of SAM and TA to obtain more reliable ensemble logits. Moreover, we introduce a Multi-level Knowledge Adaptation (MKA) module to efficiently transfer the multi-level feature knowledge from TA and ensemble logits to learn a compact student model. Extensive experiments on two benchmarks show that our GoodSAM achieves a remarkable +3.75\% mIoU improvement over the state-of-the-art (SOTA) domain adaptation methods. Also, our most lightweight model achieves comparable performance to the SOTA methods with only 3.7M parameters.
comment: Accepted to CVPR 2024
☆ Distilling Semantic Priors from SAM to Efficient Image Restoration Models
In image restoration (IR), leveraging semantic priors from segmentation models has been a common approach to improve performance. The recent segment anything model (SAM) has emerged as a powerful tool for extracting advanced semantic priors to enhance IR tasks. However, the computational cost of SAM is prohibitive for IR, compared to existing smaller IR models. The incorporation of SAM for extracting semantic priors considerably hampers the model inference efficiency. To address this issue, we propose a general framework to distill SAM's semantic knowledge to boost exiting IR models without interfering with their inference process. Specifically, our proposed framework consists of the semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD) scheme. SPF fuses two kinds of information between the restored image predicted by the original IR model and the semantic mask predicted by SAM for the refined restored image. SPD leverages a self-distillation manner to distill the fused semantic priors to boost the performance of original IR models. Additionally, we design a semantic-guided relation (SGR) module for SPD, which ensures semantic feature representation space consistency to fully distill the priors. We demonstrate the effectiveness of our framework across multiple IR models and tasks, including deraining, deblurring, and denoising.
☆ Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion
Modern neural networks are often trained on massive datasets that are web scraped with minimal human inspection. As a result of this insecure curation pipeline, an adversary can poison or backdoor the resulting model by uploading malicious data to the internet and waiting for a victim to scrape and train on it. Existing approaches for creating poisons and backdoors start with randomly sampled clean data, called base samples, and then modify those samples to craft poisons. However, some base samples may be significantly more amenable to poisoning than others. As a result, we may be able to craft more potent poisons by carefully choosing the base samples. In this work, we use guided diffusion to synthesize base samples from scratch that lead to significantly more potent poisons and backdoors than previous state-of-the-art attacks. Our Guided Diffusion Poisoning (GDP) base samples can be combined with any downstream poisoning or backdoor attack to boost its effectiveness. Our implementation code is publicly available at: https://github.com/hsouri/GDP .
☆ RSTAR: Rotational Streak Artifact Reduction in 4D CBCT using Separable and Circular Convolutions
Four-dimensional cone-beam computed tomography (4D CBCT) provides respiration-resolved images and can be used for image-guided radiation therapy. However, the ability to reveal respiratory motion comes at the cost of image artifacts. As raw projection data are sorted into multiple respiratory phases, there is a limited number of cone-beam projections available for image reconstruction. Consequently, the 4D CBCT images are covered by severe streak artifacts. Although several deep learning-based methods have been proposed to address this issue, most algorithms employ ordinary network models, neglecting the intrinsic structural prior within 4D CBCT images. In this paper, we first explore the origin and appearance of streak artifacts in 4D CBCT images.Specifically, we find that streak artifacts exhibit a periodic rotational motion along with the patient's respiration. This unique motion pattern inspires us to distinguish the artifacts from the desired anatomical structures in the spatiotemporal domain. Thereafter, we propose a spatiotemporal neural network named RSTAR-Net with separable and circular convolutions for Rotational Streak Artifact Reduction. The specially designed model effectively encodes dynamic image features, facilitating the recovery of 4D CBCT images. Moreover, RSTAR-Net is also lightweight and computationally efficient. Extensive experiments substantiate the effectiveness of our proposed method, and RSTAR-Net shows superior performance to comparison methods.
☆ ChebMixer: Efficient Graph Representation Learning with MLP Mixer
Graph neural networks have achieved remarkable success in learning graph representations, especially graph Transformer, which has recently shown superior performance on various graph mining tasks. However, graph Transformer generally treats nodes as tokens, which results in quadratic complexity regarding the number of nodes during self-attention computation. The graph MLP Mixer addresses this challenge by using the efficient MLP Mixer technique from computer vision. However, the time-consuming process of extracting graph tokens limits its performance. In this paper, we present a novel architecture named ChebMixer, a newly graph MLP Mixer that uses fast Chebyshev polynomials-based spectral filtering to extract a sequence of tokens. Firstly, we produce multiscale representations of graph nodes via fast Chebyshev polynomial-based spectral filtering. Next, we consider each node's multiscale representations as a sequence of tokens and refine the node representation with an effective MLP Mixer. Finally, we aggregate the multiscale representations of nodes through Chebyshev interpolation. Owing to the powerful representation capabilities and fast computational properties of MLP Mixer, we can quickly extract more informative node representations to improve the performance of downstream tasks. The experimental results prove our significant improvements in a variety of scenarios ranging from graph node classification to medical image segmentation.
☆ 3D-EffiViTCaps: 3D Efficient Vision Transformer with Capsule for Medical Image Segmentation ICPR2024
Medical image segmentation (MIS) aims to finely segment various organs. It requires grasping global information from both parts and the entire image for better segmenting, and clinically there are often certain requirements for segmentation efficiency. Convolutional neural networks (CNNs) have made considerable achievements in MIS. However, they are difficult to fully collect global context information and their pooling layer may cause information loss. Capsule networks, which combine the benefits of CNNs while taking into account additional information such as relative location that CNNs do not, have lately demonstrated some advantages in MIS. Vision Transformer (ViT) employs transformers in visual tasks. Transformer based on attention mechanism has excellent global inductive modeling capabilities and is expected to capture longrange information. Moreover, there have been resent studies on making ViT more lightweight to minimize model complexity and increase efficiency. In this paper, we propose a U-shaped 3D encoder-decoder network named 3D-EffiViTCaps, which combines 3D capsule blocks with 3D EfficientViT blocks for MIS. Our encoder uses capsule blocks and EfficientViT blocks to jointly capture local and global semantic information more effectively and efficiently with less information loss, while the decoder employs CNN blocks and EfficientViT blocks to catch ffner details for segmentation. We conduct experiments on various datasets, including iSeg-2017, Hippocampus and Cardiac to verify the performance and efficiency of 3D-EffiViTCaps, which performs better than previous 3D CNN-based, 3D Capsule-based and 3D Transformer-based models. We further implement a series of ablation experiments on the main blocks. Our code is available at: https://github.com/HidNeuron/3D-EffiViTCaps.
comment: 15 pages, 4 figures, submitted to ICPR2024
☆ Impact of Video Compression Artifacts on Fisheye Camera Visual Perception Tasks
Autonomous driving systems require extensive data collection schemes to cover the diverse scenarios needed for building a robust and safe system. The data volumes are in the order of Exabytes and have to be stored for a long period of time (i.e., more than 10 years of the vehicle's life cycle). Lossless compression doesn't provide sufficient compression ratios, hence, lossy video compression has been explored. It is essential to prove that lossy video compression artifacts do not impact the performance of the perception algorithms. However, there is limited work in this area to provide a solid conclusion. In particular, there is no such work for fisheye cameras, which have high radial distortion and where compression may have higher artifacts. Fisheye cameras are commonly used in automotive systems for 3D object detection task. In this work, we provide the first analysis of the impact of standard video compression codecs on wide FOV fisheye camera images. We demonstrate that the achievable compression with negligible impact depends on the dataset and temporal prediction of the video codec. We propose a radial distortion-aware zonal metric to evaluate the performance of artifacts in fisheye images. In addition, we present a novel method for estimating affine mode parameters of the latest VVC codec, and suggest some areas for improvement in video codecs for the application to fisheye imagery.
☆ MEDDAP: Medical Dataset Enhancement via Diversified Augmentation Pipeline MICCAI-2024
The effectiveness of Deep Neural Networks (DNNs) heavily relies on the abundance and accuracy of available training data. However, collecting and annotating data on a large scale is often both costly and time-intensive, particularly in medical cases where practitioners are already occupied with their duties. Moreover, ensuring that the model remains robust across various scenarios of image capture is crucial in medical domains, especially when dealing with ultrasound images that vary based on the settings of different devices and the manual operation of the transducer. To address this challenge, we introduce a novel pipeline called MEDDAP, which leverages Stable Diffusion (SD) models to augment existing small datasets by automatically generating new informative labeled samples. Pretrained checkpoints for SD are typically based on natural images, and training them for medical images requires significant GPU resources due to their heavy parameters. To overcome this challenge, we introduce USLoRA (Ultrasound Low-Rank Adaptation), a novel fine-tuning method tailored specifically for ultrasound applications. USLoRA allows for selective fine-tuning of weights within SD, requiring fewer than 0.1\% of parameters compared to fully fine-tuning only the UNet portion of SD. To enhance dataset diversity, we incorporate different adjectives into the generation process prompts, thereby desensitizing the classifiers to intensity changes across different images. This approach is inspired by clinicians' decision-making processes regarding breast tumors, where tumor shape often plays a more crucial role than intensity. In conclusion, our pipeline not only outperforms classifiers trained on the original dataset but also demonstrates superior performance when encountering unseen datasets. The source code is available at https://github.com/yasamin-med/MEDDAP.
comment: submitted to miccai 2024 submitted to miccai 2024 Submitted to MICCAI-2024
♻ ☆ BioNeRF: Biologically Plausible Neural Radiance Fields for View Synthesis
This paper presents BioNeRF, a biologically plausible architecture that models scenes in a 3D representation and synthesizes new views through radiance fields. Since NeRF relies on the network weights to store the scene's 3-dimensional representation, BioNeRF implements a cognitive-inspired mechanism that fuses inputs from multiple sources into a memory-like structure, improving the storing capacity and extracting more intrinsic and correlated information. BioNeRF also mimics a behavior observed in pyramidal cells concerning contextual information, in which the memory is provided as the context and combined with the inputs of two subsequent neural models, one responsible for producing the volumetric densities and the other the colors used to render the scene. Experimental results show that BioNeRF outperforms state-of-the-art results concerning a quality measure that encodes human perception in two datasets: real-world images and synthetic data.
♻ ☆ Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception CVPR 2024
Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However, there still remains a gap in providing fine-grained pixel-level perceptions and extending interactions beyond text-specific inputs. In this work, we propose {\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio. This innovation empowers users with greater flexibility to engage with the model beyond textual and regional prompts, without modality-specific designs. Through our proposed refocusing mechanism, the generated grounding output is guided to better focus on the referenced object, implicitly incorporating additional pixel-level supervision. This simple modification utilizes attention scores generated during the inference of LLM, eliminating the need for extra computations while exhibiting performance enhancements in both grounding masks and referring expressions. With only publicly available training data, our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
comment: CVPR 2024
♻ ☆ Word4Per: Zero-shot Composed Person Retrieval
Searching for specific person has great social benefits and security value, and it often involves a combination of visual and textual information. Conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. In this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. However, the supervised CPR requires very costly manual annotation dataset, while there are currently no available resources. To mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without expensive annotations. Secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. Thirdly, a finely annotated Image-Text Composed Person Retrieval (ITCPR) dataset is built as the benchmark to assess the performance of the proposed Word4Per framework. Extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10\%. The code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.
♻ ☆ Knowledge Distillation for Road Detection based on cross-model Semi-Supervised Learning
The advancement of knowledge distillation has played a crucial role in enabling the transfer of knowledge from larger teacher models to smaller and more efficient student models, and is particularly beneficial for online and resource-constrained applications. The effectiveness of the student model heavily relies on the quality of the distilled knowledge received from the teacher. Given the accessibility of unlabelled remote sensing data, semi-supervised learning has become a prevalent strategy for enhancing model performance. However, relying solely on semi-supervised learning with smaller models may be insufficient due to their limited capacity for feature extraction. This limitation restricts their ability to exploit training data. To address this issue, we propose an integrated approach that combines knowledge distillation and semi-supervised learning methods. This hybrid approach leverages the robust capabilities of large models to effectively utilise large unlabelled data whilst subsequently providing the small student model with rich and informative features for enhancement. The proposed semi-supervised learning-based knowledge distillation (SSLKD) approach demonstrates a notable improvement in the performance of the student model, in the application of road segmentation, surpassing the effectiveness of traditional semi-supervised learning methods.
♻ ☆ HiFi-123: Towards High-fidelity One Image to 3D Content Generation
Recent advances in diffusion models have enabled 3D generation from a single image. However, current methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods. Second, capitalizing on the RGNV, we present a novel Reference-Guided State Distillation (RGSD) loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods, both qualitatively and quantitatively. Video results are available on the project page.
comment: Project Page: https://drexubery.github.io/HiFi-123/
♻ ☆ SVGDreamer: Text Guided SVG Generation with Diffusion Model CVPR 2024
Recently, text-guided scalable vector graphics (SVGs) synthesis has shown promise in domains such as iconography and sketch. However, existing text-to-SVG generation methods lack editability and struggle with visual quality and result diversity. To address these limitations, we propose a novel text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer incorporates a semantic-driven image vectorization (SIVE) process that enables the decomposition of synthesis into foreground objects and background, thereby enhancing editability. Specifically, the SIVE process introduce attention-based primitive control and an attention-mask loss function for effective control and manipulation of individual elements. Additionally, we propose a Vectorized Particle-based Score Distillation (VPSD) approach to tackle the challenges of shape over-smoothing, color over-saturation, limited diversity in results, and slow convergence in existing text-to-SVG generation methods. VPSD models SVGs as distributions of control points and colors to counteract over-smoothing and over-saturation. Furthermore, VPSD leverages a reward model to reweight vector particles, which improves aesthetic appeal and accelerates convergence. Extensive experiments have been conducted to validate the effectiveness of SVGDreamer, demonstrating its superiority over baseline methods in terms of editability, visual quality, and diversity. The code and demo of SVGDreamer can be found at https://ximinng.github.io/SVGDreamer-project/
comment: Accepted by CVPR 2024. project link: https://ximinng.github.io/SVGDreamer-project/
♻ ☆ Variational Bayes image restoration with compressive autoencoders
Regularization of inverse problems is of paramount importance in computational imaging. The ability of neural networks to learn efficient image representations has been recently exploited to design powerful data-driven regularizers. While state-of-the-art plug-and-play methods rely on an implicit regularization provided by neural denoisers, alternative Bayesian approaches consider Maximum A Posteriori (MAP) estimation in the latent space of a generative model, thus with an explicit regularization. However, state-of-the-art deep generative models require a huge amount of training data compared to denoisers. Besides, their complexity hampers the optimization involved in latent MAP derivation. In this work, we first propose to use compressive autoencoders instead. These networks, which can be seen as variational autoencoders with a flexible latent prior, are smaller and easier to train than state-of-the-art generative models. As a second contribution, we introduce the Variational Bayes Latent Estimation (VBLE) algorithm, which performs latent estimation within the framework of variational inference. Thanks to a simple yet efficient parameterization of the variational posterior, VBLE allows for fast and easy (approximate) posterior sampling. Experimental results on image datasets BSD and FFHQ demonstrate that VBLE reaches similar performance than state-of-the-art plug-and-play methods, while being able to quantify uncertainties faster than other existing posterior sampling techniques.
♻ ☆ Mask Grounding for Referring Image Segmentation CVPR2024
Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.
comment: Accepted by CVPR2024; Project page: https://yxchng.github.io/projects/mask-grounding
♻ ☆ Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing
Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
♻ ☆ LightIt: Illumination Modeling and Control for Diffusion Models
We introduce LightIt, a method for explicit illumination control for image generation. Recent generative methods lack lighting control, which is crucial to numerous artistic aspects of image generation such as setting the overall mood or cinematic appearance. To overcome these limitations, we propose to condition the generation on shading and normal maps. We model the lighting with single bounce shading, which includes cast shadows. We first train a shading estimation module to generate a dataset of real-world images and shading pairs. Then, we train a control network using the estimated shading and normals as input. Our method demonstrates high-quality image generation and lighting control in numerous scenes. Additionally, we use our generated dataset to train an identity-preserving relighting model, conditioned on an image and a target shading. Our method is the first that enables the generation of images with controllable, consistent lighting and performs on par with specialized relighting state-of-the-art methods.
comment: Project page: https://peter-kocsis.github.io/LightIt/ Video: https://youtu.be/cCfSBD5aPLI
♻ ☆ Fully automated workflow for the design of patient-specific orthopaedic implants: application to total knee arthroplasty
Arthroplasty is commonly performed to treat joint osteoarthritis, reducing pain and improving mobility. While arthroplasty has known several technical improvements, a significant share of patients are still unsatisfied with their surgery. Personalised arthroplasty improves surgical outcomes however current solutions require delays, making it difficult to integrate in clinical routine. We propose a fully automated workflow to design patient-specific implants, presented for total knee arthroplasty, the most widely performed arthroplasty in the world nowadays. The proposed pipeline first uses artificial neural networks to segment the proximal and distal extremities of the femur and tibia. Then the full bones are reconstructed using augmented statistical shape models, combining shape and landmarks information. Finally, 77 morphological parameters are computed to design patient-specific implants. The developed workflow has been trained using 91 CT scans of lower limb and evaluated on 41 CT scans manually segmented, in terms of accuracy and execution time. The workflow accuracy was $0.4\pm0.2mm$ for the segmentation, $1.2\pm0.4mm$ for the full bones reconstruction, and $2.8\pm2.2mm$ for the anatomical landmarks determination. The custom implants fitted the patients' anatomy with $0.6\pm0.2mm$ accuracy. The whole process from segmentation to implants' design lasted about 5 minutes. The proposed workflow allows for a fast and reliable personalisation of knee implants, directly from the patient CT image without requiring any manual intervention. It establishes a patient-specific pre-operative planning for TKA in a very short time making it easily available for all patients. Combined with efficient implant manufacturing techniques, this solution could help answer the growing number of arthroplasties while reducing complications and improving the patients' satisfaction.
♻ ☆ denoiSplit: a method for joint image splitting and unsupervised denoising
In this work we present denoiSplit, a method to tackle a new analysis task, i.e. the challenge of joint semantic image splitting and unsupervised denoising. This dual approach has important applications in fluorescence microscopy, where semantic image splitting has important applications but noise does generally hinder the downstream analysis of image content. Image splitting involves dissecting an image into its distinguishable semantic structures. We show that the current state-of-the-art method for this task struggles in the presence of image noise, inadvertently also distributing the noise across the predicted outputs. The method we present here can deal with image noise by integrating an unsupervised denoising sub-task. This integration results in improved semantic image unmixing, even in the presence of notable and realistic levels of imaging noise. A key innovation in denoiSplit is the use of specifically formulated noise models and the suitable adjustment of KL-divergence loss for the high-dimensional hierarchical latent space we are training. We showcase the performance of denoiSplit across 4 tasks on real-world microscopy images. Additionally, we perform qualitative and quantitative evaluations and compare results to existing benchmarks, demonstrating the effectiveness of using denoiSplit: a single Variational Splitting Encoder-Decoder (VSE) Network using two suitable noise models to jointly perform semantic splitting and denoising.
♻ ☆ Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
comment: Code is available at https://github.com/cyh-0/CAVP
♻ ☆ FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks
We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regions. On the text side, we introduce human pose descriptions to provide rich contextual information. For human-centric tasks, FocusCLIP is trained with images from the MPII Human Pose dataset. The proposed approach surpassed CLIP by an average of 8.61% across five previously unseen datasets covering three human-centric tasks. FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP. We observed a 3.98% improvement in activity recognition, a 14.78% improvement in age classification, and a 7.06% improvement in emotion recognition. Moreover, using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset to encourage further research in multimodal learning for human-centric tasks. Furthermore, we also demonstrate the effectiveness of our subject-level supervision on non-human-centric tasks. FocusCLIP shows a 2.47% improvement over CLIP in zero-shot bird classification using the CUB dataset. Our findings emphasize the potential of integrating subject-level guidance with general pretraining methods for enhanced downstream performance.
♻ ☆ Unleashing the Power of Self-Supervised Image Denoising: A Comprehensive Review
The advent of deep learning has brought a revolutionary transformation to image denoising techniques. However, the persistent challenge of acquiring noise-clean pairs for supervised methods in real-world scenarios remains formidable, necessitating the exploration of more practical self-supervised image denoising. This paper focuses on self-supervised image denoising methods that offer effective solutions to address this challenge. Our comprehensive review thoroughly analyzes the latest advancements in self-supervised image denoising approaches, categorizing them into three distinct classes: General methods, Blind Spot Network (BSN)-based methods, and Transformer-based methods. For each class, we provide a concise theoretical analysis along with their practical applications. To assess the effectiveness of these methods, we present both quantitative and qualitative experimental results on various datasets, utilizing classical algorithms as benchmarks. Additionally, we critically discuss the current limitations of these methods and propose promising directions for future research. By offering a detailed overview of recent developments in self-supervised image denoising, this review serves as an invaluable resource for researchers and practitioners in the field, facilitating a deeper understanding of this emerging domain and inspiring further advancements.
comment: 24 pages
♻ ☆ BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image CVPR 2024
Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes, some recent work has tackled the reconstruction of hand textures on top of shapes. However, these methods are often limited to capturing pixels on the visible side of a hand, requiring diverse views of the hand in a video or multiple images as input. In this paper, we propose a novel method, BiTT(Bi-directional Texture reconstruction of Two hands), which is the first end-to-end trainable method for relightable, pose-free texture reconstruction of two interacting hands taking only a single RGB image, by three novel components: 1) bi-directional (left $\leftrightarrow$ right) texture reconstruction using the texture symmetry of left / right hands, 2) utilizing a texture parametric model for hand texture recovery, and 3) the overall coarse-to-fine stage pipeline for reconstructing personalized texture of two interacting hands. BiTT first estimates the scene light condition and albedo image from an input image, then reconstructs the texture of both hands through the texture parametric model and bi-directional texture reconstructor. In experiments using InterHand2.6M and RGB2Hands datasets, our method significantly outperforms state-of-the-art hand texture reconstruction methods quantitatively and qualitatively. The code is available at https://github.com/yunminjin2/BiTT
comment: Accepted by CVPR 2024, Project Page: https://yunminjin2.github.io/projects/bitt/
♻ ☆ Toulouse Hyperspectral Data Set: a benchmark data set to assess semi-supervised spectral representation learning and pixel-wise classification techniques
Airborne hyperspectral images can be used to map the land cover in large urban areas, thanks to their very high spatial and spectral resolutions on a wide spectral domain. While the spectral dimension of hyperspectral images is highly informative of the chemical composition of the land surface, the use of state-of-the-art machine learning algorithms to map the land cover has been dramatically limited by the availability of training data. To cope with the scarcity of annotations, semi-supervised and self-supervised techniques have lately raised a lot of interest in the community. Yet, the publicly available hyperspectral data sets commonly used to benchmark machine learning models are not totally suited to evaluate their generalization performances due to one or several of the following properties: a limited geographical coverage (which does not reflect the spectral diversity in metropolitan areas), a small number of land cover classes and a lack of appropriate standard train / test splits for semi-supervised and self-supervised learning. Therefore, we release in this paper the Toulouse Hyperspectral Data Set that stands out from other data sets in the above-mentioned respects in order to meet key issues in spectral representation learning and classification over large-scale hyperspectral images with very few labeled pixels. Besides, we discuss and experiment self-supervised techniques for spectral representation learning, including the Masked Autoencoder, and establish a baseline for pixel-wise classification achieving 85% overall accuracy and 77% F1 score. The Toulouse Hyperspectral Data Set and our code are publicly available at https://www.toulouse-hyperspectral-data-set.com and https://www.github.com/Romain3Ch216/tlse-experiments, respectively.
comment: 17 pages, 13 figures
♻ ☆ Geometric Prior Based Deep Human Point Cloud Geometry Compression
The emergence of digital avatars has raised an exponential increase in the demand for human point clouds with realistic and intricate details. The compression of such data becomes challenging with overwhelming data amounts comprising millions of points. Herein, we leverage the human geometric prior in geometry redundancy removal of point clouds, greatly promoting the compression performance. More specifically, the prior provides topological constraints as geometry initialization, allowing adaptive adjustments with a compact parameter set that could be represented with only a few bits. Therefore, we can envisage high-resolution human point clouds as a combination of geometric priors and structural deviations. The priors could first be derived with an aligned point cloud, and subsequently the difference of features is compressed into a compact latent code. The proposed framework can operate in a play-and-plug fashion with existing learning based point cloud compression methods. Extensive experimental results show that our approach significantly improves the compression performance without deteriorating the quality, demonstrating its promise in a variety of applications.
comment: Accepted by TCSVT 2024
♻ ☆ Explaining CLIP's performance disparities on data from blind/low vision users CVPR
Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.
comment: Accepted at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
♻ ☆ Distributionally Generative Augmentation for Fair Facial Attribute Classification CVPR 2024
Facial Attribute Classification (FAC) holds substantial promise in widespread applications. However, FAC models trained by traditional methodologies can be unfair by exhibiting accuracy inconsistencies across varied data subpopulations. This unfairness is largely attributed to bias in data, where some spurious attributes (e.g., Male) statistically correlate with the target attribute (e.g., Smiling). Most of existing fairness-aware methods rely on the labels of spurious attributes, which may be unavailable in practice. This work proposes a novel, generation-based two-stage framework to train a fair FAC model on biased data without additional annotation. Initially, we identify the potential spurious attributes based on generative models. Notably, it enhances interpretability by explicitly showing the spurious attributes in image space. Following this, for each image, we first edit the spurious attributes with a random degree sampled from a uniform distribution, while keeping target attribute unchanged. Then we train a fair FAC model by fostering model invariance to these augmentation. Extensive experiments on three common datasets demonstrate the effectiveness of our method in promoting fairness in FAC without compromising accuracy. Codes are in https://github.com/heqianpei/DiGA.
comment: CVPR 2024
♻ ☆ Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment
No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference, which have achieved tremendous improvements due to the utilization of deep neural networks. However, learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem, we propose a novel contrastive pre-training framework tailored for PCQA (CoPA), which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space, we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors, we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore, in the model fine-tuning stage, we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models.
♻ ☆ Differentiable Point-based Inverse Rendering
We present differentiable point-based inverse rendering, DPIR, an analysis-by-synthesis method that processes images captured under diverse illuminations to estimate shape and spatially-varying BRDF. To this end, we adopt point-based rendering, eliminating the need for multiple samplings per ray, typical of volumetric rendering, thus significantly enhancing the speed of inverse rendering. To realize this idea, we devise a hybrid point-volumetric representation for geometry and a regularized basis-BRDF representation for reflectance. The hybrid geometric representation enables fast rendering through point-based splatting while retaining the geometric details and stability inherent to SDF-based representations. The regularized basis-BRDF mitigates the ill-posedness of inverse rendering stemming from limited light-view angular samples. We also propose an efficient shadow detection method using point-based shadow map rendering. Our extensive evaluations demonstrate that DPIR outperforms prior works in terms of reconstruction accuracy, computational efficiency, and memory footprint. Furthermore, our explicit point-based representation and rendering enables intuitive geometry and reflectance editing.
♻ ☆ HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models CVPR 2024
We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.
comment: Accepted to CVPR 2024
♻ ☆ Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models
This study discusses the critical issues of Virtual Try-On in contemporary e-commerce and the prospective metaverse, emphasizing the challenges of preserving intricate texture details and distinctive features of the target person and the clothes in various scenarios, such as clothing texture and identity characteristics like tattoos or accessories. In addition to the fidelity of the synthesized images, the efficiency of the synthesis process presents a significant hurdle. Various existing approaches are explored, highlighting the limitations and unresolved aspects, e.g., identity information omission, uncontrollable artifacts, and low synthesis speed. It then proposes a novel diffusion-based solution that addresses garment texture preservation and user identity retention during virtual try-on. The proposed network comprises two primary modules - a warping module aligning clothing with individual features and a try-on module refining the attire and generating missing parts integrated with a mask-aware post-processing technique ensuring the integrity of the individual's identity. It demonstrates impressive results, surpassing the state-of-the-art in speed by nearly 20 times during inference, with superior fidelity in qualitative assessments. Quantitative evaluations confirm comparable performance with the recent SOTA method on the VITON-HD and Dresscode datasets.
♻ ☆ Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.
♻ ☆ Dispersed Structured Light for Hyperspectral 3D Imaging
Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However, existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this work, we present Dispersed Structured Light (DSL), a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placing a sub-millimeter thick diffraction grating film front of the projector. The grating disperses structured light based on light wavelength. To utilize the dispersed structured light, we devise a model for dispersive projection image formation and a per-pixel hyperspectral 3D reconstruction method. We validate DSL by instantiating a compact experimental prototype. DSL achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth error of 1mm. We demonstrate that DSL outperforms prior work on practical hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D imaging for diverse application domains, including computer vision and graphics, cultural heritage, geology, and biology.
♻ ☆ PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance.
comment: Project page: https://pi-animator.github.io/
♻ ☆ I-PHYRE: Interactive Physical Reasoning ICLR 2024
Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.
comment: 21 pages, ICLR 2024
♻ ☆ Solving the bongard-logo problem by modeling a probabilistic model
Abstract reasoning problems challenge the perceptual and cognitive abilities of AI algorithms, demanding deeper pattern discernment and inductive reasoning beyond explicit image features. This study introduces PMoC, a tailored probability model for the Bongard-Logo problem, achieving high reasoning accuracy by constructing independent probability models. Additionally, we present Pose-Transformer, an enhanced Transformer-Encoder designed for complex abstract reasoning tasks, including Bongard-Logo, RAVEN, I-RAVEN, and PGM. Pose-Transformer incorporates positional information learning, inspired by capsule networks' pose matrices, enhancing its focus on local positional relationships in image data processing. When integrated with PMoC, it further improves reasoning accuracy. Our approach effectively addresses reasoning difficulties associated with abstract entities' positional changes, outperforming previous models on the OIG, D3$\times$3 subsets of RAVEN, and PGM databases. This research contributes to advancing AI's capabilities in abstract reasoning and cognitive pattern recognition.
comment: 14 pages, 11 figures, 3 tables
♻ ☆ Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning process
Abstract reasoning problems pose significant challenges to artificial intelligence algorithms, demanding cognitive capabilities beyond those required for perception tasks. This study introduces the Triple-CFN approach to tackle the Bongard-Logo problem, achieving notable reasoning accuracy by implicitly reorganizing the concept space of conflicting instances. Additionally, the Triple-CFN paradigm proves effective for the RPM problem with necessary modifications, yielding competitive results. To further enhance performance on the RPM issue, we develop the Meta Triple-CFN network, which explicitly structures the problem space while maintaining interpretability on progressive patterns. The success of Meta Triple-CFN is attributed to its paradigm of modeling the conceptual space, equivalent to normalizing reasoning information. Based on this ideology, we introduce the Re-space layer, enhancing the performance of both Meta Triple-CFN and Triple-CFN. This paper aims to contribute to advancements in machine intelligence by exploring innovative network designs for addressing abstract reasoning problems, paving the way for further breakthroughs in this domain.
comment: 14 pages, 14 figures, 5 tables
♻ ☆ D4C glove-train: solving the RPM and Bongard-logo problem by distributing and Circumscribing concepts
This paper achieves noteworthy progress in the realm of abstract reasoning, particularly in addressing Raven's Progressive Matrices (RPM) and Bongard-Logo challenges. Initially, we introduce Lico-Net, a novel baseline model that resolves RPM problems with remarkable accuracy. Leveraging this foundation, we advance with the D3C approach, which advocates representing the underlying concepts in abstract reasoning problems through distributions. This perspective enhances the performance of both Lico-Net and a baseline model excelling in Bongard-Logo tasks. To bolster the computational efficiency of D3C, we present the D3C-cos variant, offering a streamlined yet precise solution. Furthermore, we propose the D2C method, redefining conceptual boundaries within these domains and bridging the divide between high-level abstractions and their lower-dimensional counterparts. Finally, we extend our methodology to D4C, employing adversarial techniques to refine conceptual boundaries further and demonstrate substantial improvements in both RPM and Bongard-Logo challenges. Overall, our contributions present a fresh outlook and practical advancements in the field of abstract reasoning.
comment: 18 pages, 19 figures, 6 tables
♻ ☆ CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery
We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data may contain instances from both novel categories and labelled classes. In this paper, we address the GCD problem with an unknown category number for the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations in the partially labelled data for contrastive learning, which have been neglected in existing methods. To obtain reliable cross-instance relations to facilitate representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components of a graph constructed from selective neighbors. We further present a method to estimate the unknown class number using SNC with a joint reference score that considers clustering indexes of both labelled and unlabelled data, and extend SNC to allow label assignment for the unlabelled instances with a given class number. We thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, and establish a new state-of-the-art. Code: https://github.com/haoosz/CiPR
comment: Accepted to TMLR. Code: https://github.com/haoosz/CiPR
♻ ☆ HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data CVPR 2024
Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. However, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in MLLMs, remain under-explored. This work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly, HalluciDoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. Based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA. The data and code for this paper are publicly available. \url{https://github.com/Yuqifan1117/HalluciDoctor}.
comment: Accepted by CVPR 2024
♻ ☆ W-HMR: Human Mesh Recovery in World Space with Weak-supervised Camera Calibration and Orientation Correction
For a long time, in reconstructing 3D human bodies from monocular images, most methods opted to simplify the task by minimizing the influence of the camera. Using a coarse focal length setting results in the reconstructed bodies not aligning well with distorted images. Ignoring camera rotation leads to an unrealistic reconstructed body pose in world space. Consequently, the application scenarios of existing methods are confined to controlled environments. When confronted with complex and diverse in-the-wild images, they struggle to achieve accurate and reasonable reconstruction in world space. To address the above issues, we propose W-HMR, which decouples global body recovery into camera calibration, local body recovery, and global body orientation correction. We design the first weak-supervised camera calibration method for body distortion, eliminating dependence on focal length labels and achieving finer mesh-image alignment. We propose a novel orientation correction module to allow the reconstructed human body to remain normal in world space. Decoupling body orientation and body pose enables our model to consider the accuracy in camera coordinate and the reasonableness in world coordinate simultaneously, expanding the range of applications. As a result, W-HMR achieves high-quality reconstruction in dual coordinate systems, particularly in challenging scenes. Codes and demos have been released on the project page https://yw0208.github.io/w-hmr/.
comment: Project Page: https://yw0208.github.io/w-hmr/
♻ ☆ When Semantic Segmentation Meets Frequency Aliasing ICLR 2024
Despite recent advancements in semantic segmentation, where and what pixels are hard to segment remains largely unexplored. Existing research only separates an image into easy and hard regions and empirically observes the latter are associated with object boundaries. In this paper, we conduct a comprehensive analysis of hard pixel errors, categorizing them into three types: false responses, merging mistakes, and displacements. Our findings reveal a quantitative association between hard pixels and aliasing, which is distortion caused by the overlapping of frequency components in the Fourier domain during downsampling. To identify the frequencies responsible for aliasing, we propose using the equivalent sampling rate to calculate the Nyquist frequency, which marks the threshold for aliasing. Then, we introduce the aliasing score as a metric to quantify the extent of aliasing. While positively correlated with the proposed aliasing score, three types of hard pixels exhibit different patterns. Here, we propose two novel de-aliasing filter (DAF) and frequency mixing (FreqMix) modules to alleviate aliasing degradation by accurately removing or adjusting frequencies higher than the Nyquist frequency. The DAF precisely removes the frequencies responsible for aliasing before downsampling, while the FreqMix dynamically selects high-frequency components within the encoder block. Experimental results demonstrate consistent improvements in semantic segmentation and low-light instance segmentation tasks. The code is available at: https://github.com/Linwei-Chen/Seg-Aliasing.
comment: Accepted by ICLR 2024
♻ ☆ Cell Variational Information Bottleneck Network
In this work, we propose Cell Variational Information Bottleneck Network (cellVIB), a convolutional neural network using information bottleneck mechanism, which can be combined with the latest feedforward network architecture in an end-to-end training method. Our Cell Variational Information Bottleneck Network is constructed by stacking VIB cells, which generate feature maps with uncertainty. As layers going deeper, the regularization effect will gradually increase, instead of directly adding excessive regular constraints to the output layer of the model as in Deep VIB. Under each VIB cell, the feedforward process learns an independent mean term and an standard deviation term, and predicts the Gaussian distribution based on them. The feedback process is based on reparameterization trick for effective training. This work performs an extensive analysis on MNIST dataset to verify the effectiveness of each VIB cells, and provides an insightful analysis on how the VIB cells affect mutual information. Experiments conducted on CIFAR-10 also prove that our cellVIB is robust against noisy labels during training and against corrupted images during testing. Then, we validate our method on PACS dataset, whose results show that the VIB cells can significantly improve the generalization performance of the basic model. Finally, in a more complex representation learning task, face recognition, our network structure has also achieved very competitive results.
♻ ☆ Don't Judge by the Look: Towards Motion Coherent Video Representation ICLR2024
Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video understanding and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video understanding, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at https://github.com/BeSpontaneous/MCA-pytorch.
comment: Accepted by ICLR2024
♻ ☆ Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Large-scale Text-to-Image (TTI) models have become a common approach for generating training data in various generative fields. However, visual hallucinations, which contain perceptually critical defects, remain a concern, especially in non-photorealistic styles like cartoon characters. We propose a novel visual hallucination detection system for cartoon character images generated by TTI models. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. This research advances TTI models by mitigating visual hallucinations, expanding their potential in non-photorealistic domains.
comment: 11 pages, 12 figures, 1 table, Project page: https://gh-bumsookim.github.io/Cartoon-Hallucinations-Detection/
♻ ☆ MMA-Diffusion: MultiModal Attack on Diffusion Models CVPR 2024
In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.
comment: CVPR 2024. Code is available at https://github.com/yangyijune/MMA-Diffusion
♻ ☆ Noisy-Correspondence Learning for Text-to-Image Person Re-identification
Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.
♻ ☆ CRS-Diff: Controllable Generative Remote Sensing Foundation Model
The emergence of diffusion models has revolutionized the field of image generation, providing new methods for creating high-quality, high-resolution images across various applications. However, the potential of these models for generating domain-specific images, particularly remote sensing (RS) images, remains largely untapped. RS images that are notable for their high resolution, extensive coverage, and rich information content, bring new challenges that general diffusion models may not adequately address. This paper proposes CRS-Diff, a pioneering diffusion modeling framework specifically tailored for generating remote sensing imagery, leveraging the inherent advantages of diffusion models while integrating advanced control mechanisms to ensure that the imagery is not only visually clear but also enriched with geographic and temporal information. The model integrates global and local control inputs, enabling precise combinations of generation conditions to refine the generation process. A comprehensive evaluation of CRS-Diff has demonstrated its superior capability to generate RS imagery both in a single condition and multiple conditions compared with previous methods in terms of image quality and diversity.
♻ ☆ Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence CVPR 24
While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state of the art by 5.5p and 11.0p absolute gains, respectively. Our code and datasets are publicly available at: https://telling-left-from-right.github.io/.
comment: Accepted by CVPR 24, project page: https://telling-left-from-right.github.io/
♻ ☆ VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. By presenting LLMs with pairs of instructions and their corresponding high-level programs, we harness their contextual learning capabilities to generate executable visual programs for video understanding. To enhance program's accuracy and robustness, we implement two important strategies. Firstly, we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. Secondly, taking motivation from recent works on self refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation and multi-video QA illustrate the efficacy of these enhancements in improving the performance of visual programming approaches for video tasks.
♻ ☆ URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields
We propose a novel rolling shutter bundle adjustment method for neural radiance fields (NeRF), which utilizes the unordered rolling shutter (RS) images to obtain the implicit 3D representation. Existing NeRF methods suffer from low-quality images and inaccurate initial camera poses due to the RS effect in the image, whereas, the previous method that incorporates the RS into NeRF requires strict sequential data input, limiting its widespread applicability. In constant, our method recovers the physical formation of RS images by estimating camera poses and velocities, thereby removing the input constraints on sequential data. Moreover, we adopt a coarse-to-fine training strategy, in which the RS epipolar constraints of the pairwise frames in the scene graph are used to detect the camera poses that fall into local minima. The poses detected as outliers are corrected by the interpolation method with neighboring poses. The experimental results validate the effectiveness of our method over state-of-the-art works and demonstrate that the reconstruction of 3D representations is not constrained by the requirement of video sequence input.
♻ ☆ Improving White-box Robustness of Pre-processing Defenses via Joint Adversarial Training
Deep neural networks (DNNs) are vulnerable to adversarial noise. A range of adversarial defense techniques have been proposed to mitigate the interference of adversarial noise, among which the input pre-processing methods are scalable and show great potential to safeguard DNNs. However, pre-processing methods may suffer from the robustness degradation effect, in which the defense reduces rather than improving the adversarial robustness of a target model in a white-box setting. A potential cause of this negative effect is that adversarial training examples are static and independent to the pre-processing model. To solve this problem, we investigate the influence of full adversarial examples which are crafted against the full model, and find they indeed have a positive impact on the robustness of defenses. Furthermore, we find that simply changing the adversarial training examples in pre-processing methods does not completely alleviate the robustness degradation effect. This is due to the adversarial risk of the pre-processed model being neglected, which is another cause of the robustness degradation effect. Motivated by above analyses, we propose a method called Joint Adversarial Training based Pre-processing (JATP) defense. Specifically, we formulate a feature similarity based adversarial risk for the pre-processing model by using full adversarial examples found in a feature space. Unlike standard adversarial training, we only update the pre-processing model, which prompts us to introduce a pixel-wise loss to improve its cross-model transferability. We then conduct a joint adversarial training on the pre-processing model to minimize this overall risk. Empirical results show that our method could effectively mitigate the robustness degradation effect across different target models in comparison to previous state-of-the-art approaches.
♻ ☆ Masked Vector Quantization
Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex high-dimensional data distributions. However, their performance relies on a long sequence of tokens per instance and a large number of codebook entries, resulting in long sampling times and considerable computation to fit the categorical posterior. To address these issues, we propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations via a stochastic winner-takes-all training regime called Multiple Hypothese Dropout (MH-Dropout). On ImageNet 64$\times$64, MVQ reduces FID in existing vector quantization architectures by up to $68\%$ at 2 tokens per instance and $57\%$ at 5 tokens. These improvements widen as codebook entries is reduced and allows for $7\textit{--}45\times$ speed-up in token sampling during inference. As an additional benefit, we find that smaller latent spaces lead to MVQ identifying transferable visual representations where multiple can be smoothly combined.
comment: A newer version of this manuscript was archived under 2312.11735
Information Retrieval
☆ ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search LREC
Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.
comment: Accepted to LREC-COLING 2024
☆ A comparative analysis of embedding models for patent similarity
This paper makes two contributions to the field of text-based patent similarity. First, it compares the performance of different kinds of patent-specific pretrained embedding models, namely static word embeddings (such as word2vec and doc2vec models) and contextual word embeddings (such as transformers based models), on the task of patent similarity calculation. Second, it compares specifically the performance of Sentence Transformers (SBERT) architectures with different training phases on the patent similarity task. To assess the models' performance, we use information about patent interferences, a phenomenon in which two or more patent claims belonging to different patent applications are proven to be overlapping by patent examiners. Therefore, we use these interferences cases as a proxy for maximum similarity between two patents, treating them as ground-truth to evaluate the performance of the different embedding models. Our results point out that, first, Patent SBERT-adapt-ub, the domain adaptation of the pretrained Sentence Transformer architecture proposed in this research, outperforms the current state-of-the-art in patent similarity. Second, they show that, in some cases, large static models performances are still comparable to contextual ones when trained on extensive data; thus, we believe that the superiority in the performance of contextual embeddings may not be related to the actual architecture but rather to the way the training phase is performed.
☆ LARA: Linguistic-Adaptive Retrieval-Augmented LLMs for Multi-Turn Intent Classification
Following the significant achievements of large language models (LLMs), researchers have employed in-context learning for text classification tasks. However, these studies focused on monolingual, single-turn classification tasks. In this paper, we introduce LARA (Linguistic-Adaptive Retrieval-Augmented Language Models), designed to enhance accuracy in multi-turn classification tasks across six languages, accommodating numerous intents in chatbot interactions. Multi-turn intent classification is notably challenging due to the complexity and evolving nature of conversational contexts. LARA tackles these issues by combining a fine-tuned smaller model with a retrieval-augmented mechanism, integrated within the architecture of LLMs. This integration allows LARA to dynamically utilize past dialogues and relevant intents, thereby improving the understanding of the context. Furthermore, our adaptive retrieval techniques bolster the cross-lingual capabilities of LLMs without extensive retraining and fine-tune. Comprehensive experiments demonstrate that LARA achieves state-of-the-art performance on multi-turn intent classification tasks, enhancing the average accuracy by 3.67% compared to existing methods.
☆ InstUPR : Instruction-based Unsupervised Passage Reranking with Large Language Models
This paper introduces InstUPR, an unsupervised passage reranking method based on large language models (LLMs). Different from existing approaches that rely on extensive training with query-document pairs or retrieval-specific instructions, our method leverages the instruction-following capabilities of instruction-tuned LLMs for passage reranking without any additional fine-tuning. To achieve this, we introduce a soft score aggregation technique and employ pairwise reranking for unsupervised passage reranking. Experiments on the BEIR benchmark demonstrate that InstUPR outperforms unsupervised baselines as well as an instruction-tuned reranker, highlighting its effectiveness and superiority. Source code to reproduce all experiments is open-sourced at https://github.com/MiuLab/InstUPR
comment: Preprint. This manuscript was originally written and submitted in June 2023
☆ An Experiment with the Use of ChatGPT for LCSH Subject Assignment on Electronic Theses and Dissertations
This study delves into the potential use of Large Language Models (LLMs) for generating Library of Congress Subject Headings (LCSH). The authors employed ChatGPT to generate subject headings for electronic theses and dissertations (ETDs) based on their titles and summaries. The results revealed that although some generated subject headings were valid, there were issues regarding specificity and exhaustiveness. The study showcases that LLMs can serve as a strategic response to the backlog of items awaiting cataloging in academic libraries, while also offering a cost-effective approach for promptly generating LCSH. Nonetheless, human catalogers remain essential for verifying and enhancing the validity, exhaustiveness, and specificity of LCSH generated by LLMs.
comment: 20 pages
☆ Play to Your Strengths: Collaborative Intelligence of Conventional Recommender Models and Large Language Models
The rise of large language models (LLMs) has opened new opportunities in Recommender Systems (RSs) by enhancing user behavior modeling and content understanding. However, current approaches that integrate LLMs into RSs solely utilize either LLM or conventional recommender model (CRM) to generate final recommendations, without considering which data segments LLM or CRM excel in. To fill in this gap, we conduct experiments on MovieLens-1M and Amazon-Books datasets, and compare the performance of a representative CRM (DCNv2) and an LLM (LLaMA2-7B) on various groups of data samples. Our findings reveal that LLMs excel in data segments where CRMs exhibit lower confidence and precision, while samples where CRM excels are relatively challenging for LLM, requiring substantial training data and a long training time for comparable performance. This suggests potential synergies in the combination between LLM and CRM. Motivated by these insights, we propose Collaborative Recommendation with conventional Recommender and Large Language Model (dubbed \textit{CoReLLa}). In this framework, we first jointly train LLM and CRM and address the issue of decision boundary shifts through alignment loss. Then, the resource-efficient CRM, with a shorter inference time, handles simple and moderate samples, while LLM processes the small subset of challenging samples for CRM. Our experimental results demonstrate that CoReLLa outperforms state-of-the-art CRM and LLM methods significantly, underscoring its effectiveness in recommendation tasks.
☆ Uncovering Selective State Space Model's Capabilities in Lifelong Sequential Recommendation
Sequential Recommenders have been widely applied in various online services, aiming to model users' dynamic interests from their sequential interactions. With users increasingly engaging with online platforms, vast amounts of lifelong user behavioral sequences have been generated. However, existing sequential recommender models often struggle to handle such lifelong sequences. The primary challenges stem from computational complexity and the ability to capture long-range dependencies within the sequence. Recently, a state space model featuring a selective mechanism (i.e., Mamba) has emerged. In this work, we investigate the performance of Mamba for lifelong sequential recommendation (i.e., length>=2k). More specifically, we leverage the Mamba block to model lifelong user sequences selectively. We conduct extensive experiments to evaluate the performance of representative sequential recommendation models in the setting of lifelong sequences. Experiments on two real-world datasets demonstrate the superiority of Mamba. We found that RecMamba achieves performance comparable to the representative model while significantly reducing training duration by approximately 70% and memory costs by 80%. Codes and data are available at \url{https://github.com/nancheng58/RecMamba}.
☆ Enhanced Facet Generation with LLM Editing LREC
In information retrieval, facet identification of a user query is an important task. If a search service can recognize the facets of a user's query, it has the potential to offer users a much broader range of search results. Previous studies can enhance facet prediction by leveraging retrieved documents and related queries obtained through a search engine. However, there are challenges in extending it to other applications when a search engine operates as part of the model. First, search engines are constantly updated. Therefore, additional information may change during training and test, which may reduce performance. The second challenge is that public search engines cannot search for internal documents. Therefore, a separate search system needs to be built to incorporate documents from private domains within the company. We propose two strategies that focus on a framework that can predict facets by taking only queries as input without a search engine. The first strategy is multi-task learning to predict SERP. By leveraging SERP as a target instead of a source, the proposed model deeply understands queries without relying on external modules. The second strategy is to enhance the facets by combining Large Language Model (LLM) and the small model. Overall performance improves when small model and LLM are combined rather than facet generation individually.
comment: Accepted at LREC-COLING 2024
☆ EXPLORA: A teacher-apprentice methodology for eliciting natural child-computer interactions
Investigating child-computer interactions within their contexts is vital for designing technology that caters to children's needs. However, determining what aspects of context are relevant for designing child-centric technology remains a challenge. We introduce EXPLORA, a multimodal, multistage online methodology comprising three pivotal stages: (1) building a teacher-apprentice relationship,(2) learning from child-teachers, and (3) assessing and reinforcing researcher-apprentice learning. Central to EXPLORA is the collection of attitudinal data through pre-observation interviews, offering researchers a deeper understanding of children's characteristics and contexts. This informs subsequent online observations, allowing researchers to focus on frequent interactions. Furthermore, researchers can validate preliminary assumptions with children. A means-ends analysis framework aids in the systematic analysis of data, shedding light on context, agency and homework-information searching processes children employ in their activities. To illustrate EXPLORA's capabilities, we present nine single case studies investigating Brazilian child-caregiver dyads' (children ages 9-11) use of technology in homework information-searching.
☆ CADGL: Context-Aware Deep Graph Learning for Predicting Drug-Drug Interactions
Examining Drug-Drug Interactions (DDIs) is a pivotal element in the process of drug development. DDIs occur when one drug's properties are affected by the inclusion of other drugs. Detecting favorable DDIs has the potential to pave the way for creating and advancing innovative medications applicable in practical settings. However, existing DDI prediction models continue to face challenges related to generalization in extreme cases, robust feature extraction, and real-life application possibilities. We aim to address these challenges by leveraging the effectiveness of context-aware deep graph learning by introducing a novel framework named CADGL. Based on a customized variational graph autoencoder (VGAE), we capture critical structural and physio-chemical information using two context preprocessors for feature extraction from two different perspectives: local neighborhood and molecular context, in a heterogeneous graphical structure. Our customized VGAE consists of a graph encoder, a latent information encoder, and an MLP decoder. CADGL surpasses other state-of-the-art DDI prediction models, excelling in predicting clinically valuable novel DDIs, supported by rigorous case studies.
comment: 8 Pages, 4 Figures; In review in IEEE/ACM Transactions on Computational Biology and Bioinformatics
☆ Generation of Asset Administration Shell with Large Language Model Agents: Interoperability in Digital Twins with Semantic Node
This research introduces a novel approach for assisting the creation of Asset Administration Shell (AAS) instances for digital twin modeling within the context of Industry 4.0, aiming to enhance interoperability in smart manufacturing and reduce manual effort. We construct a "semantic node" data structure to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process "semantic node" and generate AAS instance models from textual technical data. Our evaluation demonstrates a 62-79% effective generation rate, indicating a substantial proportion of manual creation effort can be converted into easier validation effort, thereby reducing the time and cost in creating AAS instance models. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts. Our findings emphasize LLMs' capability in automating AAS instance creation, enhancing semantic interoperability, and contributing to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are released on our GitHub Repository with the link: https://github.com/YuchenXia/AASbyLLM
comment: Pre-print, submitted to IEEE ACCESS, under peer-review
☆ GOLF: Goal-Oriented Long-term liFe tasks supported by human-AI collaboration
The advent of ChatGPT and similar large language models (LLMs) has revolutionized the human-AI interaction and information-seeking process. Leveraging LLMs as an alternative to search engines, users can now access summarized information tailored to their queries, significantly reducing the cognitive load associated with navigating vast information resources. This shift underscores the potential of LLMs in redefining information access paradigms. Drawing on the foundation of task-focused information retrieval and LLMs' task planning ability, this research extends the scope of LLM capabilities beyond routine task automation to support users in navigating long-term and significant life tasks. It introduces the GOLF framework (Goal-Oriented Long-term liFe tasks), which focuses on enhancing LLMs' ability to assist in significant life decisions through goal orientation and long-term planning. The methodology encompasses a comprehensive simulation study to test the framework's efficacy, followed by model and human evaluations to develop a dataset benchmark for long-term life tasks, and experiments across different models and settings. By shifting the focus from short-term tasks to the broader spectrum of long-term life goals, this research underscores the transformative potential of LLMs in enhancing human decision-making processes and task management, marking a significant step forward in the evolution of human-AI collaboration.
☆ Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action Modeling
Reinforcement Learning (RL)-based recommender systems have demonstrated promising performance in meeting user expectations by learning to make accurate next-item recommendations from historical user-item interactions. However, existing offline RL-based sequential recommendation methods face the challenge of obtaining effective user feedback from the environment. Effectively modeling the user state and shaping an appropriate reward for recommendation remains a challenge. In this paper, we leverage language understanding capabilities and adapt large language models (LLMs) as an environment (LE) to enhance RL-based recommenders. The LE is learned from a subset of user-item interaction data, thus reducing the need for large training data, and can synthesise user feedback for offline data by: (i) acting as a state model that produces high quality states that enrich the user representation, and (ii) functioning as a reward model to accurately capture nuanced user preferences on actions. Moreover, the LE allows to generate positive actions that augment the limited offline training data. We propose a LE Augmentation (LEA) method to further improve recommendation performance by optimising jointly the supervised component and the RL policy, using the augmented actions and historical user signals. We use LEA, the state and reward models in conjunction with state-of-the-art RL recommenders and report experimental results on two publicly available datasets.
☆ GloSIS: The Global Soil Information System Web Ontology
Established in 2012 by members of the Food and Agriculture Organisation (FAO), the Global Soil Partnership (GSP) is a global network of stakeholders promoting sound land and soil management practices towards a sustainable world food system. However, soil survey largely remains a local or regional activity, bound to heterogeneous methods and conventions. Recognising the relevance of global and trans-national policies towards sustainable land management practices, the GSP elected data harmonisation and exchange as one of its key lines of action. Building upon international standards and previous work towards a global soil data ontology, an improved domain model was eventually developed within the GSP [54], the basis for a Global Soil Information System (GloSIS). This work also identified the Semantic Web as a possible avenue to operationalise the domain model. This article presents the GloSIS web ontology, an implementation of the GloSIS domain model with the Web Ontology Language (OWL). Thoroughly employing a host of Semantic Web standards (SOSA, SKOS, GeoSPARQL, QUDT), GloSIS lays out not only a soil data ontology but also an extensive set of ready-to-use code-lists for soil description and physio-chemical analysis. Various examples are provided on the provision and use of GloSIS-compliant linked data, showcasing the contribution of this ontology to the discovery, exploration, integration and access of soil data.
☆ Graph Augmentation for Recommendation ICDE 2024
Graph augmentation with contrastive learning has gained significant attention in the field of recommendation systems due to its ability to learn expressive user representations, even when labeled data is limited. However, directly applying existing GCL models to real-world recommendation environments poses challenges. There are two primary issues to address. Firstly, the lack of consideration for data noise in contrastive learning can result in noisy self-supervised signals, leading to degraded performance. Secondly, many existing GCL approaches rely on graph neural network (GNN) architectures, which can suffer from over-smoothing problems due to non-adaptive message passing. To address these challenges, we propose a principled framework called GraphAug. This framework introduces a robust data augmentor that generates denoised self-supervised signals, enhancing recommender systems. The GraphAug framework incorporates a graph information bottleneck (GIB)-regularized augmentation paradigm, which automatically distills informative self-supervision information and adaptively adjusts contrastive view generation. Through rigorous experimentation on real-world datasets, we thoroughly assessed the performance of our novel GraphAug model. The outcomes consistently unveil its superiority over existing baseline methods. The source code for our model is publicly available at: https://github.com/HKUDS/GraphAug.
comment: 13 pages and accepted by ICDE 2024
♻ ☆ NoteLLM: A Retrievable Large Language Model for Note Recommendation WWW'24
People enjoy sharing "notes" including their experiences within online communities. Therefore, recommending notes aligned with user interests has become a crucial task. Existing online methods only input notes into BERT-based models to generate note embeddings for assessing similarity. However, they may underutilize some important cues, e.g., hashtags or categories, which represent the key concepts of notes. Indeed, learning to generate hashtags/categories can potentially enhance note embeddings, both of which compress key note information into limited content. Besides, Large Language Models (LLMs) have significantly outperformed BERT in understanding natural languages. It is promising to introduce LLMs into note recommendation. In this paper, we propose a novel unified framework called NoteLLM, which leverages LLMs to address the item-to-item (I2I) note recommendation. Specifically, we utilize Note Compression Prompt to compress a note into a single special token, and further learn the potentially related notes' embeddings via a contrastive learning approach. Moreover, we use NoteLLM to summarize the note and generate the hashtag/category automatically through instruction tuning. Extensive validations on real scenarios demonstrate the effectiveness of our proposed method compared with the online baseline and show major improvements in the recommendation system of Xiaohongshu.
comment: Published as a WWW'24 full paper
♻ ☆ Word4Per: Zero-shot Composed Person Retrieval
Searching for specific person has great social benefits and security value, and it often involves a combination of visual and textual information. Conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. In this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. However, the supervised CPR requires very costly manual annotation dataset, while there are currently no available resources. To mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without expensive annotations. Secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. Thirdly, a finely annotated Image-Text Composed Person Retrieval (ITCPR) dataset is built as the benchmark to assess the performance of the proposed Word4Per framework. Extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10\%. The code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.
♻ ☆ GNNUERS: Fairness Explanation in GNNs for Recommendation via Counterfactual Reasoning
Nowadays, research into personalization has been focusing on explainability and fairness. Several approaches proposed in recent works are able to explain individual recommendations in a post-hoc manner or by explanation paths. However, explainability techniques applied to unfairness in recommendation have been limited to finding user/item features mostly related to biased recommendations. In this paper, we devised a novel algorithm that leverages counterfactuality methods to discover user unfairness explanations in the form of user-item interactions. In our counterfactual framework, interactions are represented as edges in a bipartite graph, with users and items as nodes. Our bipartite graph explainer perturbs the topological structure to find an altered version that minimizes the disparity in utility between the protected and unprotected demographic groups. Experiments on four real-world graphs coming from various domains showed that our method can systematically explain user unfairness on three state-of-the-art GNN-based recommendation models. Moreover, an empirical evaluation of the perturbed network uncovered relevant patterns that justify the nature of the unfairness discovered by the generated explanations. The source code and the preprocessed data sets are available at https://github.com/jackmedda/RS-BGExplainer.
♻ ☆ On the resilience of Collaborative Learning-based Recommender Systems Against Community Detection Attack
Collaborative-learning-based recommender systems emerged following the success of collaborative learning techniques such as Federated Learning (FL) and Gossip Learning (GL). In these systems, users participate in the training of a recommender system while maintaining their history of consumed items on their devices. While these solutions seemed appealing for preserving the privacy of the participants at first glance, recent studies have revealed that collaborative learning can be vulnerable to various privacy attacks. In this paper, we study the resilience of collaborative learning-based recommender systems against a novel privacy attack called Community Detection Attack (CDA). This attack enables an adversary to identify community members based on a chosen set of items (eg., identifying users interested in specific points-of-interest). Through experiments on three real recommendation datasets using two state-of-the-art recommendation models, we evaluate the sensitivity of an FL-based recommender system as well as two flavors of Gossip Learning-based recommender systems to CDA. The results show that across all models and datasets, the FL setting is more vulnerable to CDA compared to Gossip settings. Furthermore, we assess two off-the-shelf mitigation strategies, namely differential privacy (DP) and a \emph{Share less} policy, which consists of sharing a subset of less sensitive model parameters. The findings indicate a more favorable privacy-utility trade-off for the \emph{Share less} strategy, particularly in FedRecs.
♻ ☆ Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation NAACL 2024
Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.
comment: NAACL 2024
Mathematical Software
☆ Symbolic and User-friendly Geometric Algebra Routines (SUGAR) for Computations in Matlab
Geometric algebra (GA) is a mathematical tool for geometric computing, providing a framework that allows a unified and compact approach to geometric relations which in other mathematical systems are typically described using different more complicated elements. This fact has led to an increasing adoption of GA in applied mathematics and engineering problems. However, the scarcity of symbolic implementations of GA and its inherent complexity, requiring a specific mathematical background, make it challenging and less intuitive for engineers to work with. This prevents wider adoption among more applied professionals. To address this challenge, this paper introduces SUGAR (Symbolic and User-friendly Geometric Algebra Routines), an open-source toolbox designed for Matlab and licensed under the MIT License. SUGAR facilitates the translation of GA concepts into Matlab and provides a collection of user-friendly functions tailored for GA computations, including support for symbolic operations. It supports both numeric and symbolic computations in high-dimensional GAs. Specifically tailored for applied mathematics and engineering applications, SUGAR has been meticulously engineered to represent geometric elements and transformations within two and three-dimensional projective and conformal geometric algebras, aligning with established computational methodologies in the literature. Furthermore, SUGAR efficiently handles functions of multivectors, such as exponential, logarithmic, sinusoidal, and cosine functions, enhancing its applicability across various engineering domains, including robotics, control systems, and power electronics. Finally, this work includes four distinct validation examples, demonstrating SUGAR's capabilities across the above-mentioned fields and its practical utility in addressing real-world applied mathematics and engineering problems.
comment: 33 pages, 6 figures, journal paper submitted to ACM TOMS
Data Structures and Algorithms
☆ Adaptive Frequency Bin Interval in FFT via Dense Sampling Factor $α$
The Fast Fourier Transform (FFT) is a fundamental tool for signal analysis, widely used across various fields. However, traditional FFT methods encounter challenges in adjusting the frequency bin interval, which may impede accurate spectral analysis. In this study, we propose a method for adjusting the frequency bin interval in FFT by introducing a parameter $\alpha$. We elucidate the underlying principles of the proposed method and discuss its potential applications across various contexts. Our findings suggest that the proposed method offers a promising approach to overcome the limitations of traditional FFT methods and enhance spectral analysis accuracy.
☆ Algorithms and data structures for numerical computations with automatic precision estimation
We introduce data structures and algorithms to count numerical inaccuracies arising from usage of floating numbers described in IEEE 754. Here we describe how to estimate precision for some collection of functions most commonly used for array manipulations and training of neural networks. For highly optimized functions like matrix multiplication, we provide a fast estimation of precision and some hint how the estimation can be strengthened.
☆ Minimum-cost paths for electric cars
An electric car equipped with a battery of a finite capacity travels on a road network with an infrastructure of charging stations. Each charging station has a possibly different cost per unit of energy. Traversing a given road segment requires a specified amount of energy that may be positive, zero or negative. The car can only traverse a road segment if it has enough charge to do so (the charge cannot drop below zero), and it cannot charge its battery beyond its capacity. To travel from one point to another the car needs to choose a \emph{travel plan} consisting of a path in the network and a recharging schedule that specifies how much energy to charge at each charging station on the path, making sure of having enough energy to reach the next charging station or the destination. The cost of the plan is the total charging cost along the chosen path. We reduce the problem of computing plans between every two junctions of the network to two problems: Finding optimal energetic paths when no charging is allowed and finding standard shortest paths. When there are no negative cycles in the network, we obtain an $O(n^3)$-time algorithm for computing all-pairs travel plans, where~$n$ is the number of junctions in the network. We obtain slightly faster algorithms under some further assumptions. We also consider the case in which a bound is placed on the number of rechargings allowed.
comment: 9 pages, 2 figures, SOSA 2024
☆ High-Temperature Gibbs States are Unentangled and Efficiently Preparable
We show that thermal states of local Hamiltonians are separable above a constant temperature. Specifically, for a local Hamiltonian $H$ on a graph with degree $\mathfrak{d}$, its Gibbs state at inverse temperature $\beta$, denoted by $\rho =e^{-\beta H}/ \textrm{tr}(e^{-\beta H})$, is a classical distribution over product states for all $\beta < 1/(c\mathfrak{d})$, where $c$ is a constant. This sudden death of thermal entanglement upends conventional wisdom about the presence of short-range quantum correlations in Gibbs states. Moreover, we show that we can efficiently sample from the distribution over product states. In particular, for any $\beta < 1/( c \mathfrak{d}^3)$, we can prepare a state $\epsilon$-close to $\rho$ in trace distance with a depth-one quantum circuit and $\textrm{poly}(n) \log(1/\epsilon)$ classical overhead. A priori the task of preparing a Gibbs state is a natural candidate for achieving super-polynomial quantum speedups, but our results rule out this possibility above a fixed constant temperature.
♻ ☆ Multi-Agent Online Graph Exploration on Cycles and Tadpole Graphs
We study the problem of multi-agent online graph exploration, in which a team of k agents has to explore a given graph, starting and ending on the same node. The graph is initially unknown. Whenever a node is visited by an agent, its neighborhood and adjacent edges are revealed. The agents share a global view of the explored parts of the graph. The cost of the exploration has to be minimized, where cost either describes the time needed for the entire exploration (time model), or the length of the longest path traversed by any agent (energy model). We investigate graph exploration on cycles and tadpole graphs for 2-4 agents, providing optimal results on the competitive ratio in the energy model (1-competitive with two agents on cycles and three agents on tadpole graphs), and for tadpole graphs in the time model (1.5-competitive with four agents). We also show competitive upper bounds of 2 for the exploration of tadpole graphs with three agents, and 2.5 for the exploration of tadpole graphs with two agents in the time model.
comment: v2: Update Related Work, more detailed description of models in Motivation
♻ ☆ Spectral clustering in the Gaussian mixture block model
Gaussian mixture block models are distributions over graphs that strive to model modern networks: to generate a graph from such a model, we associate each vertex $i$ with a latent feature vector $u_i \in \mathbb{R}^d$ sampled from a mixture of Gaussians, and we add edge $(i,j)$ if and only if the feature vectors are sufficiently similar, in that $\langle u_i,u_j \rangle \ge \tau$ for a pre-specified threshold $\tau$. The different components of the Gaussian mixture represent the fact that there may be different types of nodes with different distributions over features -- for example, in a social network each component represents the different attributes of a distinct community. Natural algorithmic tasks associated with these networks are embedding (recovering the latent feature vectors) and clustering (grouping nodes by their mixture component). In this paper we initiate the study of clustering and embedding graphs sampled from high-dimensional Gaussian mixture block models, where the dimension of the latent feature vectors $d\to \infty$ as the size of the network $n \to \infty$. This high-dimensional setting is most appropriate in the context of modern networks, in which we think of the latent feature space as being high-dimensional. We analyze the performance of canonical spectral clustering and embedding algorithms for such graphs in the case of 2-component spherical Gaussian mixtures, and begin to sketch out the information-computation landscape for clustering and embedding in these models.
comment: 48 pages
♻ ☆ Implementing any Linear Combination of Unitaries on Intermediate-term Quantum Computers
We develop three new methods to implement any Linear Combination of Unitaries (LCU), a powerful quantum algorithmic tool with diverse applications. While the standard LCU procedure requires several ancilla qubits and sophisticated multi-qubit controlled operations, our methods consume significantly fewer quantum resources. The first method (Single-Ancilla LCU) estimates expectation values of observables with respect to any quantum state prepared by an LCU procedure while requiring only a single ancilla qubit, and no multi-qubit controlled operations. The second approach (Analog LCU) is a simple, physically motivated, continuous-time analogue of LCU, tailored to hybrid qubit-qumode systems. The third method (Ancilla-free LCU) requires no ancilla qubit at all and is useful when we are interested in the projection of a quantum state (prepared by the LCU procedure) in some subspace of interest. We apply the first two techniques to develop new quantum algorithms for a wide range of practical problems, ranging from Hamiltonian simulation, ground state preparation and property estimation, and quantum linear systems. Remarkably, despite consuming fewer quantum resources they retain a provable quantum advantage. The third technique allows us to connect discrete and continuous-time quantum walks with their classical counterparts. It also unifies the recently developed optimal quantum spatial search algorithms in both these frameworks, and leads to the development of new ones that require fewer ancilla qubits. Overall, our results are quite generic and can be readily applied to other problems, even beyond those considered here.
comment: 68+24 pages, 3 Figures. Added detailed comparisons with prior works
♻ ☆ The Discrepancy of Shortest Paths
The hereditary discrepancy of a set system is a certain quantitative measure of the pseudorandom properties of the system. Roughly, hereditary discrepancy measures how well one can $2$-color the elements of the system so that each set contains approximately the same number of elements of each color. Hereditary discrepancy has well-studied applications e.g. in communication complexity and derandomization. More recently, the hereditary discrepancy of set systems of shortest paths has found applications in differential privacy [Chen et al.~SODA 23]. The contribution of this paper is to improve the upper and lower bounds on the hereditary discrepancy of set systems of unique shortest paths in graphs. In particular, we show that any system of unique shortest paths in an undirected weighted graph has hereditary discrepancy $\widetilde{O}(n^{1/4})$, and we construct lower bound examples demonstrating that this bound is tight up to hidden $\text{polylog } n$ factors. Our lower bounds apply even in the planar and bipartite settings, and they improve on a previous lower bound of $\Omega(n^{1/6})$ obtained by applying the trace bound of Chazelle and Lvov [SoCG'00] to a classical point-line system of Erd\H{o}s. As applications, we improve the lower bound on the additive error for differentially-private all pairs shortest distances from $\Omega(n^{1/6})$ [Chen et al.~SODA 23] to $\Omega(n^{1/4})$, and we improve the lower bound on additive error for the differentially-private all sets range queries problem to $\Omega(n^{1/4})$, which is tight up to hidden $\text{polylog } n$ factors [Deng et al.~WADS 23].
♻ ☆ Vehicle Routing with Time-Dependent Travel Times: Theory, Practice, and Benchmarks
We develop theoretical foundations and practical algorithms for vehicle routing with time-dependent travel times. We also provide new benchmark instances and experimental results. First, we study basic operations on piecewise linear arrival time functions. In particular, we devise a faster algorithm to compute the pointwise minimum of a set of piecewise linear functions and a monotonicity-preserving variant of the Imai-Iri algorithm to approximate an arrival time function with fewer breakpoints. Next, we show how to evaluate insertion and deletion operations in tours efficiently and update the underlying data structure faster than previously known when a tour changes. Evaluating a tour also requires a scheduling step which is non-trivial in the presence of time windows and time-dependent travel times. We show how to perform this in linear time. Based on these results, we develop a local search heuristic to solve real-world vehicle routing problems with various constraints efficiently and report experimental results on classical benchmarks. Since most of these do not have time-dependent travel times, we generate and publish new benchmark instances that are based on real-world data. This data also demonstrates the importance of considering time-dependent travel times in instances with tight time windows.
Signal Processing
☆ Resonant Beam Communications: A New Design Paradigm and Challenges
Resonant beam communications (RBCom), which adopt oscillating photons between two separate retroreflectors for information transmission, exhibit potential advantages over other types of wireless optical communications (WOC). However, echo interference generated by the modulated beam reflected from the receiver affects the transmission of the desired information. To tackle this challenge, a synchronization-based point-to-point RBCom system is proposed to eliminate the echo interference, and the design for the transmitter and receiver is discussed. Subsequently, the performance of the proposed RBCom is evaluated and compared with that of visible light communications (VLC) and free space optical communications (FOC). Finally, future research directions are outlined and several implementation challenges of RBCom systems are highlighted.
☆ Design and Performance of Resonant Beam Communications -- Part II: Mobile Scenario
This two-part paper focuses on the system design and performance analysis for a point-to-point resonant beam communication (RBCom) system under both the quasi-static and mobile scenarios. Part I of this paper proposes a synchronization-based information transmission scheme and derives the capacity upper and lower bounds for the quasi-static channel case. In Part II, we address the mobile scenario, where the receiver is in relative motion to the transmitter, and derive a mobile RBCom channel model that jointly considers the Doppler effect, channel variation, and echo interference. With the obtained channel model, we prove that the channel gain of the mobile RBCom decreases as the number of transmitted frames increases, and thus show that the considered mobile RBCom terminates after the transmitter sends a certain number of frames without frequency compensation. By deriving an upper bound on the number of successfully transmitted frames, we formulate the throughput maximization problem for the considered mobile RBCom system, and solve it via a sequential parametric convex approximation (SPCA) method. Finally, simulation results validate the analysis of our proposed method in some typical scenarios.
☆ Design and Performance of Resonant Beam Communications -- Part I: Quasi-Static Scenario
This two-part paper studies a point-to-point resonant beam communication (RBCom) system, where two separately deployed retroreflectors are adopted to generate the resonant beam between the transmitter and the receiver, and analyzes the transmission rate of the considered system under both the quasi-static and mobile scenarios. Part I of this paper focuses on the quasi-static scenario where the locations of the transmitter and the receiver are relatively fixed. Specifically, we propose a new information-bearing scheme which adopts a synchronization-based amplitude modulation method to mitigate the echo interference caused by the reflected resonant beam. With this scheme, we show that the quasi-static RBCom channel is equivalent to a Markov channel and can be further simplified as an amplitude-constrained additive white Gaussian noise channel. Moreover, we develop an algorithm that jointly employs the bisection and exhaustive search to maximize its capacity upper and lower bounds. Finally, numerical results validate our analysis. Part II of this paper discusses the performance of the RBCom system under the mobile scenario.
☆ Adaptive Frequency Bin Interval in FFT via Dense Sampling Factor $α$
The Fast Fourier Transform (FFT) is a fundamental tool for signal analysis, widely used across various fields. However, traditional FFT methods encounter challenges in adjusting the frequency bin interval, which may impede accurate spectral analysis. In this study, we propose a method for adjusting the frequency bin interval in FFT by introducing a parameter $\alpha$. We elucidate the underlying principles of the proposed method and discuss its potential applications across various contexts. Our findings suggest that the proposed method offers a promising approach to overcome the limitations of traditional FFT methods and enhance spectral analysis accuracy.
☆ Near-field Beam Steering with Planar Antenna Array
Beam steering enables manipulation of the electromagnetic radiation patterns in antenna array systems. A methodology for steering beams in the near field of a planar antenna array with known phase wavefront functions towards arbitrary azimuth and elevation angles is described in this paper. Rotation of the phase wavefront function is used while preserving the shape. The phase shifts for antenna element excitation currents are determined based on the distances from antenna elements to the nearest point on the rotated wavefront. Beam steering utilizing a Gaussian and a Bessel beam is studied. Phase distribution examples for various steering directions are considered. For Bessel beam steering, a discussion on the resulting beam shapes and the steering impact on the polarization mismatch is provided. The results show a non-negligible magnitude of the cross-polarization with values depending on the steering direction.
comment: This paper was submitted to Seventh International Balkan Conference on Communications and Networkingm (BalkanCom24), Ljubljana, Slovenia, 2024
☆ Exploit High-Dimensional RIS Information to Localization: What Is the Impact of Faulty Element?
This paper proposes a novel localization algorithm using the reconfigurable intelligent surface (RIS) received signal, i.e., RIS information. Compared with BS received signal, i.e., BS information, RIS information offers higher dimension and richer feature set, thereby providing an enhanced capacity to distinguish positions of the mobile users (MUs). Additionally, we address a practical scenario where RIS contains some unknown (number and places) faulty elements that cannot receive signals. Initially, we employ transfer learning to design a two-phase transfer learning (TPTL) algorithm, designed for accurate detection of faulty elements. Then our objective is to regain the information lost from the faulty elements and reconstruct the complete high-dimensional RIS information for localization. To this end, we propose a transfer-enhanced dual-stage (TEDS) algorithm. In \emph{Stage I}, we integrate the CNN and variational autoencoder (VAE) to obtain the RIS information, which in \emph{Stage II}, is input to the transferred DenseNet 121 to estimate the location of the MU. To gain more insight, we propose an alternative algorithm named transfer-enhanced direct fingerprint (TEDF) algorithm which only requires the BS information. The comparison between TEDS and TEDF reveals the effectiveness of faulty element detection and the benefits of utilizing the high-dimensional RIS information for localization. Besides, our empirical results demonstrate that the performance of the localization algorithm is dominated by the high-dimensional RIS information and is robust to unoptimized phase shifts and signal-to-noise ratio (SNR).
☆ Employing High-Dimensional RIS Information for RIS-aided Localization Systems
Reconfigurable intelligent surface (RIS)-aided localization systems have attracted extensive research attention due to their accuracy enhancement capabilities. However, most studies primarily utilized the base stations (BS) received signal, i.e., BS information, for localization algorithm design, neglecting the potential of RIS received signal, i.e., RIS information. Compared with BS information, RIS information offers higher dimension and richer feature set, thereby significantly improving the ability to extract positions of the mobile users (MUs). Addressing this oversight, this paper explores the algorithm design based on the high-dimensional RIS information. Specifically, we first propose a RIS information reconstruction (RIS-IR) algorithm to reconstruct the high-dimensional RIS information from the low-dimensional BS information. The proposed RIS-IR algorithm comprises a data processing module for preprocessing BS information, a convolution neural network (CNN) module for feature extraction, and an output module for outputting the reconstructed RIS information. Then, we propose a transfer learning based fingerprint (TFBF) algorithm that employs the reconstructed high-dimensional RIS information for MU localization. This involves adapting a pre-trained DenseNet-121 model to map the reconstructed RIS signal to the MU's three-dimensional (3D) position. Empirical results affirm that the localization performance is significantly influenced by the high-dimensional RIS information and maintains robustness against unoptimized phase shifts.
☆ BackCom Assisted Hybrid NOMA Uplink Transmission for Ambient IoT
Hybrid non-orthogonal multiple access (H-NOMA) has recently received significant attention as a general framework of multiple access, where both conventional orthogonal multiple access (OMA) and pure NOMA are its special cases. This paper focuses on the application of H-NOMA to ambient Internet of Things (IoT) with energy-constrained devices, where a new backscatter communication (BackCom) assisted H-NOMA uplink scheme is developed. Resource allocation for H-NOMA uplink transmission is also considered, where an overall power minimization problem is formulated. Insightful understandings for the key features of BackCom assisted H-NOMA and its difference from conventional H-NOMA are illustrated by developing analytical results for the two-user special case. For the general multi-user scenario, two algorithms, one based on the branch-bound (BB) principle and the other based on successive convex approximation (SCA), are developed to realize different tradeoffs between the system performance and complexity. The numerical results are also provided to verify the accuracy of the developed analytical results and demonstrate the performance gain of H-NOMA over OMA.
☆ Safeguarding Next Generation Multiple Access Using Physical Layer Security Techniques: A Tutorial
Driven by the ever-increasing requirements of ultra-high spectral efficiency, ultra-low latency, and massive connectivity, the forefront of wireless research calls for the design of advanced next generation multiple access schemes to facilitate provisioning of these stringent demands. This inspires the embrace of non-orthogonal multiple access (NOMA) in future wireless communication networks. Nevertheless, the support of massive access via NOMA leads to additional security threats, due to the open nature of the air interface, the broadcast characteristic of radio propagation as well as intertwined relationship among paired NOMA users. To address this specific challenge, the superimposed transmission of NOMA can be explored as new opportunities for security aware design, for example, multiuser interference inherent in NOMA can be constructively engineered to benefit communication secrecy and privacy. The purpose of this tutorial is to provide a comprehensive overview on the state-of-the-art physical layer security techniques that guarantee wireless security and privacy for NOMA networks, along with the opportunities, technical challenges, and future research trends.
comment: Invited paper by Proceedings of the IEEE
☆ Power-Aware Sparse Reflect Beamforming in Active RIS-aided Interference Channels
Active reconfigurable intelligent surface (RIS) has attracted significant attention in wireless communications, due to its reflecting elements (REs) capable of reflecting incident signals with not only phase shifts but also amplitude amplifications. In this paper, we are interested in active RIS-aided interference channels in which $K$ user pairs share the same time and frequency resources with the aid of active RIS. Thanks to the promising amplitude amplification capability, activating a moderate number of REs, rather than all of them, is sufficient for the active RIS to mitigate cross-channel interferences. Motivated by this, we propose a power-aware sparse reflect beamforming design for the active RIS-aided interference channels, which allows the active RIS to flexibly adjust the number of activated REs for the sake of reducing hardware and power costs. Specifically, we establish the power consumption model in which only those activated REs consume the biasing and operation power that supports the amplitude amplification, yielding an $\ell_0$-norm power consumption function. Based on the proposed model, we investigate a sum-rate maximization problem and an active RIS power minimization problem by carefully designing the sparse reflect beamforming vector. To solve these problems, we first replace the nonconvex $\ell_0$-norm function with an iterative reweighted $\ell_1$-norm function. Then, fractional programming is used to solve the sum-rate maximization, while semidefinite programming together with the difference-of-convex algorithm (DCA) is used to solve the active RIS power minimization. Numerical results show that the proposed sparse designs can notably increase the sum rate of user pairs and decrease the power consumption of active RIS in interference channels.
☆ Unified Integrated Sensing and Communication Signal Design: A Sphere Packing Perspective
The design of communication signal sets is fundamentally a sphere packing problem. It aims to identify a set of M points in an N -dimensional space, with the objective of maximizing the separability of points that represent different bits.In contrast, signals used for sensing targets should ideally be asdeterministic as possible. This paper explores the inherent conflict and trade-off between communication and sensing when these functions are combined within the same signal set. We present a unified approach to signal design in the time, frequency, and space domains for integrated sensing and communication (ISAC), framing it as a modified sphere packing problem. Through adept formula manipulation, this problem is transformed into a large-scale quadratic constrained quadratic programming (QCQP) challenge. We propose an augmented Lagrangian and dual ascent (ALDA) algorithm for iterative problem-solving. The computational complexity of this approach is analyzed and found to be daunting for large, high-dimensional signal set designs. To address this, we introduce a bit-dimension-power splitting (BDPS) method. This method decomposes the large-scale QCQP into a series of smaller-scale problems that can be solved more efficiently and in parallel, significantly reducing the overall computational load. Extensive simulations have been conducted to validate the effectiveness of our proposed signal design methods in the context of ISAC.
comment: submitted to IEEE TCOM
☆ Next Generation Advanced Transceiver Technologies for 6G
To accommodate new applications such as extended reality, fully autonomous vehicular networks and the metaverse, next generation wireless networks are going to be subject to much more stringent performance requirements than the fifth-generation (5G) in terms of data rates, reliability, latency, and connectivity. It is thus necessary to develop next generation advanced transceiver (NGAT) technologies for efficient signal transmission and reception. In this tutorial, we explore the evolution of NGAT from three different perspectives. Specifically, we first provide an overview of new-field NGAT technology, which shifts from conventional far-field channel models to new near-field channel models. Then, three new-form NGAT technologies and their design challenges are presented, including reconfigurable intelligent surfaces, flexible antennas, and holographic multi-input multi-output (MIMO) systems. Subsequently, we discuss recent advances in semantic-aware NGAT technologies, which can utilize new metrics for advanced transceiver designs. Finally, we point out other promising transceiver technologies for future research.
comment: This paper gives a comprehensive tutorial overview of next generation advanced transceiver (NGAT) technologies for 6G
☆ Single-Carrier Delay-Doppler Domain Equalization
For doubly-selective channels, delay-Doppler (DD) modulation, mostly known as orthogonal time frequency space (OTFS) modulation, enables simultaneous compensation of delay and Doppler shifts. However, OTFS modulated signal has high peak-to-average power ratio (PAPR) because of its precoding operation performed over the DD domain. In order to deal with this problem, we propose a single-carrier transmission with delay-Doppler domain equalization (SC-DDE). In this system, the discretized time-domain SC signal is converted to the DD domain by discrete Zak transform (DZT) at the receiver side, followed by delay-Doppler domain equalization (DDE). Since equalization is performed in the DD domain, the SC-DDE receiver should acquire the channel delay-Doppler response. To this end, we introduce an embedded pilot-aided channel estimation scheme designed for SC-DDE, which does not affect the peak power property of transmitted signals. Through computer simulation, distribution of PAPR and bit error rate (BER) performance of the proposed system are compared with those of the conventional OTFS and SC with frequency-domain equalization (SC-FDE). As a result, our proposed SC-DDE significantly outperforms SC-FDE in terms of BER at the expense of additional computational complexity at the receiver. Furthermore, SC-DDE shows much lower PAPR than OTFS even though they achieve comparable coded BER performance.
comment: 12 pages, 10 figures, submitted to IEEE journal
☆ Accuracy-Aware Cooperative Sensing and Computing for Connected Autonomous Vehicles
To maintain high perception performance among connected and autonomous vehicles (CAVs), in this paper, we propose an accuracy-aware and resource-efficient raw-level cooperative sensing and computing scheme among CAVs and road-side infrastructure. The scheme enables fined-grained partial raw sensing data selection, transmission, fusion, and processing in per-object granularity, by exploiting the parallelism among object classification subtasks associated with each object. A supervised learning model is trained to capture the relationship between the object classification accuracy and the data quality of selected object sensing data, facilitating accuracy-aware sensing data selection. We formulate an optimization problem for joint sensing data selection, subtask placement and resource allocation among multiple object classification subtasks, to minimize the total resource cost while satisfying the delay and accuracy requirements. A genetic algorithm based iterative solution is proposed for the optimization problem. Simulation results demonstrate the accuracy awareness and resource efficiency achieved by the proposed cooperative sensing and computing scheme, in comparison with benchmark solutions.
☆ RadioGAT: A Joint Model-based and Data-driven Framework for Multi-band Radiomap Reconstruction via Graph Attention Networks
Multi-band radiomap reconstruction (MB-RMR) is a key component in wireless communications for tasks such as spectrum management and network planning. However, traditional machine-learning-based MB-RMR methods, which rely heavily on simulated data or complete structured ground truth, face significant deployment challenges. These challenges stem from the differences between simulated and actual data, as well as the scarcity of real-world measurements. To address these challenges, our study presents RadioGAT, a novel framework based on Graph Attention Network (GAT) tailored for MB-RMR within a single area, eliminating the need for multi-region datasets. RadioGAT innovatively merges model-based spatial-spectral correlation encoding with data-driven radiomap generalization, thus minimizing the reliance on extensive data sources. The framework begins by transforming sparse multi-band data into a graph structure through an innovative encoding strategy that leverages radio propagation models to capture the spatial-spectral correlation inherent in the data. This graph-based representation not only simplifies data handling but also enables tailored label sampling during training, significantly enhancing the framework's adaptability for deployment. Subsequently, The GAT is employed to generalize the radiomap information across various frequency bands. Extensive experiments using raytracing datasets based on real-world environments have demonstrated RadioGAT's enhanced accuracy in supervised learning settings and its robustness in semi-supervised scenarios. These results underscore RadioGAT's effectiveness and practicality for MB-RMR in environments with limited data availability.
comment: submitted to IEEE journal for possible publication
☆ SignSGD with Federated Voting
Distributed learning is commonly used for accelerating model training by harnessing the computational capabilities of multiple-edge devices. However, in practical applications, the communication delay emerges as a bottleneck due to the substantial information exchange required between workers and a central parameter server. SignSGD with majority voting (signSGD-MV) is an effective distributed learning algorithm that can significantly reduce communication costs by one-bit quantization. However, due to heterogeneous computational capabilities, it fails to converge when the mini-batch sizes differ among workers. To overcome this, we propose a novel signSGD optimizer with \textit{federated voting} (signSGD-FV). The idea of federated voting is to exploit learnable weights to perform weighted majority voting. The server learns the weights assigned to the edge devices in an online fashion based on their computational capabilities. Subsequently, these weights are employed to decode the signs of the aggregated local gradients in such a way to minimize the sign decoding error probability. We provide a unified convergence rate analysis framework applicable to scenarios where the estimated weights are known to the parameter server either perfectly or imperfectly. We demonstrate that the proposed signSGD-FV algorithm has a theoretical convergence guarantee even when edge devices use heterogeneous mini-batch sizes. Experimental results show that signSGD-FV outperforms signSGD-MV, exhibiting a faster convergence rate, especially in heterogeneous mini-batch sizes.
☆ Energy-Efficient Hybrid Beamforming with Dynamic On-off Control for Integrated Sensing, Communications, and Powering
This paper investigates the energy-efficient hybrid beamforming design for a multi-functional integrated sensing, communications, and powering (ISCAP) system. In this system, a base station (BS) with a hybrid analog-digital (HAD) architecture sends unified wireless signals to communicate with multiple information receivers (IRs), sense multiple point targets, and wirelessly charge multiple energy receivers (ERs) at the same time. To facilitate the energy-efficient design, we present a novel HAD architecture for the BS transmitter, which allows dynamic on-off control of its radio frequency (RF) chains and analog phase shifters (PSs) through a switch network. We also consider a practical and comprehensive power consumption model for the BS, by taking into account the power-dependent non-linear power amplifier (PA) efficiency, and the on-off non-transmission power consumption model of RF chains and PSs. We jointly design the hybrid beamforming and dynamic on-off control at the BS, aiming to minimize its total power consumption, while guaranteeing the performance requirements on communication rates, sensing Cram\'er-Rao bound (CRB), and harvested power levels. The formulation also takes into consideration the per-antenna transmit power constraint and the constant modulus constraints for the analog beamformer at the BS. The resulting optimization problem for ISCAP is highly non-convex. Please refer to the paper for a complete abstract.
comment: 13 pages, 6 figures, submitted to IEEE Transactions on Communications
☆ 200Gb/s VCSEL transmission using 60m OM4 MMF and KP4 FEC for AI computing clusters
We show a beyond 200Gb/s VCSEL transmission experiment. Results are based on 35GHz VCSEL and advanced DSP. We show an AIR of 245Gb/s PAM-6 back-to-back, and 200Gb/s PAM-4 over 60m OM4 fiber assuming KP4-FEC.
☆ Cache-Enabled Millimetre-Wave Fluid Antenna Systems: Modeling and Performance
This letter investigates the performance of content caching in a heterogeneous cellular network (HetNet) consisting of fluid antenna system (FAS)-equipped mobile users (MUs) and millimeter-wave (mm-wave) single-antenna small base stations (SBSs), distributed according to the independent homogeneous Poisson point processes (HPPP). In particular, it is assumed that the most popular contents are cached in the SBSs to serve the FAS-equipped MUs requests. To assess the system performance, we derive compact expressions for the successful content delivery probability (SCDP) and the content delivery delay (CDD) using the Gauss-Laguerre quadrature technique. Our numerical results show that the performance of cache-enabled mm-wave HetNets can be greatly improved, when the FAS is utilized at the MUs instead of traditional fixed-antenna system deployment.
☆ Latency-Aware Generative Semantic Communications with Pre-Trained Diffusion Models
Generative foundation AI models have recently shown great success in synthesizing natural signals with high perceptual quality using only textual prompts and conditioning signals to guide the generation process. This enables semantic communications at extremely low data rates in future wireless networks. In this paper, we develop a latency-aware semantic communications framework with pre-trained generative models. The transmitter performs multi-modal semantic decomposition on the input signal and transmits each semantic stream with the appropriate coding and communication schemes based on the intent. For the prompt, we adopt a re-transmission-based scheme to ensure reliable transmission, and for the other semantic modalities we use an adaptive modulation/coding scheme to achieve robustness to the changing wireless channel. Furthermore, we design a semantic and latency-aware scheme to allocate transmission power to different semantic modalities based on their importance subjected to semantic quality constraints. At the receiver, a pre-trained generative model synthesizes a high fidelity signal using the received multi-stream semantics. Simulation results demonstrate ultra-low-rate, low-latency, and channel-adaptive semantic communications.
☆ Physics-compliant diagonal representation of beyond-diagonal RIS
Physics-compliant models of RIS-parametrized channels assign a load-terminated port to each RIS element. For conventional diagonal RIS (D-RIS), each auxiliary port is terminated by its own independent and individually tunable load (i.e., independent of the other auxiliary ports). For beyond-diagonal RIS (BD-RIS), the auxiliary ports are terminated by a tunable load circuit which couples the auxiliary ports to each other. Here, we point out that a physics-compliant model of the load circuit of a BD-RIS takes the same form as a physics-compliant model of a D-RIS-parametrized radio environment: a multi-port network with a subset of ports terminated by individually tunable loads (independent of each other). Consequently, we recognize that a BD-RIS-parametrized radio environment can be understood as a multi-port cascade network (i.e., the cascade of radio environment with load circuit) terminated by individually tunable loads (independent of each other). Hence, the BD-RIS problem can be mapped into the original D-RIS problem by replacing the radio environment with the cascade of radio environment and load circuit. The insight that BD-RIS can be physics-compliantly analyzed with the conventional D-RIS formalism implies that (i) the same optimization protocols as for D-RIS can be used for the BD-RIS case, and (ii) it is unclear if existing comparisons between BD-RIS and D-RIS are fair because for a fixed number of RIS elements, a BD-RIS has usually more tunable lumped elements.
comment: 5 pages, 2 figures, submitted to an IEEE Conference
☆ On the Intersection of Signal Processing and Machine Learning: A Use Case-Driven Analysis Approach
Recent advancements in sensing, measurement, and computing technologies have significantly expanded the potential for signal-based applications, leveraging the synergy between signal processing and Machine Learning (ML) to improve both performance and reliability. This fusion represents a critical point in the evolution of signal-based systems, highlighting the need to bridge the existing knowledge gap between these two interdisciplinary fields. Despite many attempts in the existing literature to bridge this gap, most are limited to specific applications and focus mainly on feature extraction, often assuming extensive prior knowledge in signal processing. This assumption creates a significant obstacle for a wide range of readers. To address these challenges, this paper takes an integrated article approach. It begins with a detailed tutorial on the fundamentals of signal processing, providing the reader with the necessary background knowledge. Following this, it explores the key stages of a standard signal processing-based ML pipeline, offering an in-depth review of feature extraction techniques, their inherent challenges, and solutions. Differing from existing literature, this work offers an application-independent review and introduces a novel classification taxonomy for feature extraction techniques. Furthermore, it aims at linking theoretical concepts with practical applications, and demonstrates this through two specific use cases: a spectral-based method for condition monitoring of rolling bearings and a wavelet energy analysis for epilepsy detection using EEG signals. In addition to theoretical contributions, this work promotes a collaborative research culture by providing a public repository of relevant Python and MATLAB signal processing codes. This effort is intended to support collaborative research efforts and ensure the reproducibility of the results presented.
☆ 6D Movable Antenna Enhanced Wireless Network Via Discrete Position and Rotation Optimization
Six-dimensional movable antenna (6DMA) is an effective approach to improve wireless network capacity by adjusting the 3D positions and 3D rotations of distributed antenna surfaces based on the users' spatial distribution and statistical channel information. Although continuously positioning/rotating 6DMA surfaces can achieve the greatest flexibility and thus the highest capacity improvement, it is difficult to implement due to the discrete movement constraints of practical stepper motors. Thus, in this paper, we consider a 6DMA-aided base station (BS) with only a finite number of possible discrete positions and rotations for the 6DMA surfaces. We aim to maximize the average network capacity for random numbers of users at random locations by jointly optimizing the 3D positions and 3D rotations of multiple 6DMA surfaces at the BS subject to discrete movement constraints. In particular, we consider the practical cases with and without statistical channel knowledge of the users, and propose corresponding offline and online optimization algorithms, by leveraging the Monte Carlo and conditional sample mean (CSM) methods, respectively. Simulation results verify the effectiveness of our proposed offline and online algorithms for discrete position/rotation optimization of 6DMA surfaces as compared to various benchmark schemes with fixed-position antennas (FPAs) and 6DMAs with limited movability. It is shown that 6DMA-BS can significantly enhance wireless network capacity, even under discrete position/rotation constraints, by exploiting the spatial distribution characteristics of the users.
comment: 13 pages, double column
☆ Efficient generation of realistic guided wave signals for reliability estimation
Across non-destructive testing (NDT) and structural health monitoring (SHM), accurate knowledge of the systems' reliability for detecting defects, such as Probability of Detection (POD) analysis is essential to enabling widespread adoption. Traditionally this relies on access to extensive experimental data to cover all critical areas of the parametric space, which becomes expensive, and heavily undermines the benefit such systems bring. In response to these challenges, reliability estimation based on numerical simulation emerges as a practical solution, offering enhanced efficiency and cost-effectiveness. Nevertheless, precise reliability estimation demands that the simulated data faithfully represents the real-world performance. In this context, a numerical framework tailored to generate realistic signals for reliability estimation purposes is presented here, focusing on the application of guided wave SHM for pipe monitoring. It specifically incorporates key characteristics of real signals: random noise and coherent noise caused by the imbalance in transducer performance within guided wave monitoring systems. The effectiveness of our proposed methodology is demonstrated through a comprehensive comparative analysis between simulation-generated signals and experimental signals both individually and statistically. Furthermore, to assess the reliability of a guided wave system in terms of the inspection range for pipe monitoring, a series of POD analyses using simulation-generated data were conducted. The comparison of POD curves derived from ideal and realistic simulation data underscores the necessity of considering coherent noise for accurate POD curve calculations. Moreover, the POD analysis based on realistic simulation-generated data provides a quantitative estimation of the inspection range with more details compared to the current industry practice.
☆ Enhancing UAV Security Through Zero Trust Architecture: An Advanced Deep Learning and Explainable AI Analysis
In the dynamic and ever-changing domain of Unmanned Aerial Vehicles (UAVs), the utmost importance lies in guaranteeing resilient and lucid security measures. This study highlights the necessity of implementing a Zero Trust Architecture (ZTA) to enhance the security of unmanned aerial vehicles (UAVs), hence departing from conventional perimeter defences that may expose vulnerabilities. The Zero Trust Architecture (ZTA) paradigm requires a rigorous and continuous process of authenticating all network entities and communications. The accuracy of our methodology in detecting and identifying unmanned aerial vehicles (UAVs) is 84.59\%. This is achieved by utilizing Radio Frequency (RF) signals within a Deep Learning framework, a unique method. Precise identification is crucial in Zero Trust Architecture (ZTA), as it determines network access. In addition, the use of eXplainable Artificial Intelligence (XAI) tools such as SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) contributes to the improvement of the model's transparency and interpretability. Adherence to Zero Trust Architecture (ZTA) standards guarantees that the classifications of unmanned aerial vehicles (UAVs) are verifiable and comprehensible, enhancing security within the UAV field.
comment: 6 pages, 5 figures
☆ Backpropagation through space, time, and the brain
Effective learning in neuronal networks requires the adaptation of individual synapses given their relative contribution to solving a task. However, physical neuronal systems -- whether biological or artificial -- are constrained by spatio-temporal locality. How such networks can perform efficient credit assignment, remains, to a large extent, an open question. In Machine Learning, the answer is almost universally given by the error backpropagation algorithm, through both space (BP) and time (BPTT). However, BP(TT) is well-known to rely on biologically implausible assumptions, in particular with respect to spatiotemporal (non-)locality, while forward-propagation models such as real-time recurrent learning (RTRL) suffer from prohibitive memory constraints. We introduce Generalized Latent Equilibrium (GLE), a computational framework for fully local spatio-temporal credit assignment in physical, dynamical networks of neurons. We start by defining an energy based on neuron-local mismatches, from which we derive both neuronal dynamics via stationarity and parameter dynamics via gradient descent. The resulting dynamics can be interpreted as a real-time, biologically plausible approximation of BPTT in deep cortical networks with continuous-time neuronal dynamics and continuously active, local synaptic plasticity. In particular, GLE exploits the ability of biological neurons to phase-shift their output rate with respect to their membrane potential, which is essential in both directions of information propagation. For the forward computation, it enables the mapping of time-continuous inputs to neuronal space, performing an effective spatiotemporal convolution. For the backward computation, it permits the temporal inversion of feedback signals, which consequently approximate the adjoint states necessary for useful parameter updates.
comment: 15 pages, 7 figures
☆ Provably Robust Score-Based Diffusion Posterior Sampling for Plug-and-Play Image Reconstruction
In a great number of tasks in science and engineering, the goal is to infer an unknown image from a small number of measurements collected from a known forward model describing certain sensing or imaging modality. Due to resource constraints, this task is often extremely ill-posed, which necessitates the adoption of expressive prior information to regularize the solution space. Score-based diffusion models, due to its impressive empirical success, have emerged as an appealing candidate of an expressive prior in image reconstruction. In order to accommodate diverse tasks at once, it is of great interest to develop efficient, consistent and robust algorithms that incorporate {\em unconditional} score functions of an image prior distribution in conjunction with flexible choices of forward models. This work develops an algorithmic framework for employing score-based diffusion models as an expressive data prior in general nonlinear inverse problems. Motivated by the plug-and-play framework in the imaging community, we introduce a diffusion plug-and-play method (\textsf{DPnP}) that alternatively calls two samplers, a proximal consistency sampler based solely on the likelihood function of the forward model, and a denoising diffusion sampler based solely on the score functions of the image prior. The key insight is that denoising under white Gaussian noise can be solved {\em rigorously} via both stochastic (i.e., DDPM-type) and deterministic (i.e., DDIM-type) samplers using the unconditional score functions. We establish both asymptotic and non-asymptotic performance guarantees of \textsf{DPnP}, and provide numerical experiments to illustrate its promise in solving both linear and nonlinear image reconstruction tasks. To the best of our knowledge, \textsf{DPnP} is the first provably-robust posterior sampling method for nonlinear inverse problems using unconditional diffusion priors.
☆ Movable-Antenna Position Optimization: A Graph-based Approach
Fluid antennas (FAs) and movable antennas (MAs) have emerged as promising technologies in wireless communications, which offer the flexibility to improve channel conditions by adjusting transmit/receive antenna positions within a spatial region. In this letter, we focus on an MA-enhanced multiple-input single-output (MISO) communication system, aiming to optimize the positions of multiple transmit MAs to maximize the received signal power. Unlike the prior works on continuously searching for the optimal MA positions, we propose to sample the transmit region into discrete points, such that the continuous antenna position optimization problem is transformed to a discrete sampling point selection problem based on the point-wise channel information. However, such a point selection problem is combinatory and challenging to be optimally solved. To tackle this challenge, we ingeniously recast it as an equivalent fixed-hop shortest path problem in graph theory and propose a customized algorithm to solve it optimally in polynomial time. To further reduce the complexity, a linear-time sequential update algorithm is also proposed to obtain a high-quality suboptimal solution. Numerical results demonstrate that the proposed algorithms can yield considerable performance gains over the conventional fixed-position antennas with/without antenna selection.
comment: 5 pages, 6 figures. We propose a graph-based algorithm that is able to optimally solve the fluid-/movable-antenna position optimization problem in polynomial time
☆ Proprioception Is All You Need: Terrain Classification for Boreal Forests IROS 2024
Recent works in field robotics highlighted the importance of resiliency against different types of terrains. Boreal forests, in particular, are home to many mobility-impeding terrains that should be considered for off-road autonomous navigation. Also, being one of the largest land biomes on Earth, boreal forests are an area where autonomous vehicles are expected to become increasingly common. In this paper, we address this issue by introducing BorealTC, a publicly available dataset for proprioceptive-based terrain classification (TC). Recorded with a Husky A200, our dataset contains 116 min of Inertial Measurement Unit (IMU), motor current, and wheel odometry data, focusing on typical boreal forest terrains, notably snow, ice, and silty loam. Combining our dataset with another dataset from the state-of-the-art, we evaluate both a Convolutional Neural Network (CNN) and the novel state space model (SSM)-based Mamba architecture on a TC task. Interestingly, we show that while CNN outperforms Mamba on each separate dataset, Mamba achieves greater accuracy when trained on a combination of both. In addition, we demonstrate that Mamba's learning capacity is greater than a CNN for increasing amounts of data. We show that the combination of two TC datasets yields a latent space that can be interpreted with the properties of the terrains. We also discuss the implications of merging datasets on classification. Our source code and dataset are publicly available online: https://github.com/norlab-ulaval/BorealTC.
comment: Submitted to the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
☆ Exploring Communication Technologies, Standards, and Challenges in Electrified Vehicle Charging
As public awareness of environmental protection continues to grow, the trend of integrating more electric vehicles (EVs) into the transportation sector is rising. Unlike conventional internal combustion engine (ICE) vehicles, EVs can minimize carbon emissions and potentially achieve autonomous driving. However, several obstacles hinder the widespread adoption of EVs, such as their constrained driving range and the extended time required for charging. One alternative solution to address these challenges is implementing dynamic wireless power transfer (DWPT), charging EVs in motion on the road. Moreover, charging stations with static wireless power transfer (SWPT) infrastructure can replace existing gas stations, enabling users to charge EVs in parking lots or at home. This paper surveys the communication infrastructure for static and dynamic wireless charging in electric vehicles. It encompasses all communication aspects involved in the wireless charging process. The architecture and communication requirements for static and dynamic wireless charging are presented separately. Additionally, a comprehensive comparison of existing communication standards is provided. The communication with the grid is also explored in detail. The survey gives attention to security and privacy issues arising during communications. In summary, the paper addresses the challenges and outlines upcoming trends in communication for EV wireless charging.
comment: submitted to IET Communication as a survey paper
☆ Two-Dimensional Frequency-Difference-of-Arrival Varieties
This paper studies Frequency-Difference-of-Arrival (FDOA) curves for the 2-dimensional, 2-sensor case. The primary focus of this paper is to give a description of curves associated to the FDOA problem from the algebro-geometric point of view. To be more precise, the complex projective picture of the family of FDOA curves for all possible relative velocities is described.
comment: 64 pages, 10 figures
☆ RKFNet: A Novel Neural Network Aided Robust Kalman Filter
Driven by the filtering challenges in linear systems disturbed by non-Gaussian heavy-tailed noise, the robust Kalman filters (RKFs) leveraging diverse heavy-tailed distributions have been introduced. However, the RKFs rely on precise noise models, and large model errors can degrade their filtering performance. Also, the posterior approximation by the employed variational Bayesian (VB) method can further decrease the estimation precision. Here, we introduce an innovative RKF method, the RKFNet, which combines the heavy-tailed-distribution-based RKF framework with the deep learning (DL) technique and eliminates the need for the precise parameters of the heavy-tailed distributions. To reduce the VB approximation error, the mixing-parameter-based function and the scale matrix are estimated by the incorporated neural network structures. Also, the stable training process is achieved by our proposed unsupervised scheduled sampling (USS) method, where a loss function based on the Student's t (ST) distribution is utilised to overcome the disturbance of the noise outliers and the filtering results of the traditional RKFs are employed as reference sequences. Furthermore, the RKFNet is evaluated against various RKFs and recurrent neural networks (RNNs) under three kinds of heavy-tailed measurement noises, and the simulation results showcase its efficacy in terms of estimation accuracy and efficiency.
☆ Resolution Limit of Single-Photon LiDAR
Single-photon Light Detection and Ranging (LiDAR) systems are often equipped with an array of detectors for improved spatial resolution and sensing speed. However, given a fixed amount of flux produced by the laser transmitter across the scene, the per-pixel Signal-to-Noise Ratio (SNR) will decrease when more pixels are packed in a unit space. This presents a fundamental trade-off between the spatial resolution of the sensor array and the SNR received at each pixel. Theoretical characterization of this fundamental limit is explored. By deriving the photon arrival statistics and introducing a series of new approximation techniques, the Mean Squared Error (MSE) of the maximum-likelihood estimator of the time delay is derived. The theoretical predictions align well with simulations and real data.
♻ ☆ Computation-Limited Signals: A Channel Capacity Regime Constrained by Computational Complexity
In this letter, we introduce the computational-limited (comp-limited) signals, a communication capacity regime in which the signal time computational complexity overhead is the key constraint -- rather than power or bandwidth -- to the overall communication capacity. We present the Spectro-Computational (SC) analysis, a novel mathematical framework that enhances classic concepts of information theory -- such as throughput, spectral efficiency and capacity -- to account for the signal processing computational complexity overhead. We consider a specific Shannon regime under which capacity is expected to get arbitrarily large as channel resources grow. Under that regime, we identify the conditions under which the time complexity overhead causes capacity to decrease rather than increasing, thereby creating the case for the comp-limited regime. We also provide examples of the SC analysis and show the OFDM waveform is comp-limited unless the lower-bound computational complexity of the $N$-point DFT problem verifies as $\Omega(N)$, which remains an open challenge.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Performance Analysis of In-Band-Full-Duplex Multi-Cell Wideband IAB Networks
This paper analyzes the performance of the 3rd Generation Partnership Project (3GPP)-inspired multi-cell wideband single-hop backhaul millimeter-wave-in-band-full-duplex (IBFD)-integrated access and backhaul (IAB) networks by using stochastic geometry. We model the wired-connected Next Generation NodeBs (gNBs) as the Mat\'ern hard-core point process (MHCPP) to meet the real-world deployment requirement and reduce the cost caused by wired connection in the network. We first derive association probabilities that reflect how likely the typical user-equipment is served by a gNB or an IAB-node based on the maximum long-term averaged biased-received-desired-signal power criteria. Further, by leveraging the composite Gamma-Lognormal distribution, we derive the closed-form signal to interference plus noise ratio coverage, capacity with outage, and ergodic capacity of the network. In order to avoid underestimating the noise, we consider the sidelobe gain on inter-cell interference links and the analog to digital converter quantization noise. Compared with the half-duplex transmission, numerical results show an enhanced capacity with outage and ergodic capacity provided by IBFD under successful self-interference cancellation. We also study how the power bias and density ratio of the IAB-node to gNB, and the hard-core distance can affect system performances.
♻ ☆ Probabilistic positioning via ray tracing with noisy angle of arrival measurements
This paper investigates the problems of interference prediction and sensing for efficient spectrum access and link adaptation. The considered approach for interference prediction relies on a parametric model. However, we assume that the number of observations available to learn theses parameters is limited. This implies that they should be treated as random variables rather than fixed values. We show how this can impact the spectrum access and link adaptation strategies. We also introduce the notion of "interferer-coherence time" to establish the number of independent interferer state realizations experienced by a codeword. We explain how it can be computed taking into account the model uncertainty and how this impacts the link adaptation.
comment: Submitted to IEEE Communications Letters
♻ ☆ Analysis and Compensation of Carrier Frequency Offset Impairments in Unique Word OFDM
Unique Word-orthogonal frequency division multiplexing (UW-OFDM) is known to provide various performance benefits over conventional cyclic prefix (CP) based OFDM. Most important, UW-OFDM features excellent spectral sidelobe suppression properties and an outstanding bit error ratio (BER) performance. Carrier frequency offset (CFO) induced impairments denote a challenging task for OFDM systems of any kind. In this work we investigate the CFO effects on UW-OFDM and compare it to conventional multi-carrier and single-carrier systems. Different CFO compensation approaches with different computational complexity are considered throughout this work and assessed against each other. A mean squared error analysis carried out after data estimation reveals a significant higher robustness of UW-OFDM over CP-OFDM against CFO effects. Additionally, the conducted BER simulations generally support this conclusion for various scenarios, ranging from uncoded to coded transmission in a frequency selective environment.
♻ ☆ Asymptotic Error Rates for Point Process Classification
Point processes are finding growing applications in numerous fields, such as neuroscience, high frequency finance and social media. So classic problems of classification and clustering are of increasing interest. However, analytic study of misclassification error probability in multi-class classification has barely begun. In this paper, we tackle the multi-class likelihood classification problem for point processes and develop, for the first time, both asymptotic upper and lower bounds on the error rate in terms of computable pair-wise affinities. We apply these general results to classifying renewal processes. Under some technical conditions, we show that the bounds have exponential decay and give explicit associated constants. The results are illustrated with a non-trivial simulation.
comment: 25 pages, 3 figures
♻ ☆ Federated Learning Using Three-Operator ADMM
Federated learning (FL) has emerged as an instance of distributed machine learning paradigm that avoids the transmission of data generated on the users' side. Although data are not transmitted, edge devices have to deal with limited communication bandwidths, data heterogeneity, and straggler effects due to the limited computational resources of users' devices. A prominent approach to overcome such difficulties is FedADMM, which is based on the classical two-operator consensus alternating direction method of multipliers (ADMM). The common assumption of FL algorithms, including FedADMM, is that they learn a global model using data only on the users' side and not on the edge server. However, in edge learning, the server is expected to be near the base station and have direct access to rich datasets. In this paper, we argue that leveraging the rich data on the edge server is much more beneficial than utilizing only user datasets. Specifically, we show that the mere application of FL with an additional virtual user node representing the data on the edge server is inefficient. We propose FedTOP-ADMM, which generalizes FedADMM and is based on a three-operator ADMM-type technique that exploits a smooth cost function on the edge server to learn a global model parallel to the edge devices. Our numerical experiments indicate that FedTOP-ADMM has substantial gain up to 33\% in communication efficiency to reach a desired test accuracy with respect to FedADMM, including a virtual user on the edge server.
comment: accepted to IEEE Journal of Selected Topics in Signal Processing, 2022
♻ ☆ Finite-Precision Arithmetic Transceiver for Massive MIMO Systems
Efficient implementation of massive multiple-input-multiple-output (MIMO) transceivers is essential for the next-generation wireless networks. To reduce the high computational complexity of the massive MIMO transceiver, in this paper, we propose a new massive MIMO architecture using finite-precision arithmetic. First, we conduct the rounding error analysis and derive the lower bound of the achievable rate for single-input-multiple-output (SIMO) using maximal ratio combining (MRC) and multiple-input-single-output (MISO) systems using maximal ratio transmission (MRT) with finite-precision arithmetic. Then, considering the multi-user scenario, the rounding error analysis of zero-forcing (ZF) detection and precoding is derived by using the normal equations (NE) method. The corresponding lower bounds of the achievable sum rate are also derived and asymptotic analyses are presented. Built upon insights from these analyses and lower bounds, we propose a mixed-precision architecture for massive MIMO systems to offset performance gaps due to finite-precision arithmetic. The corresponding analysis of rounding errors and computational costs is obtained. Simulation results validate the derived bounds and underscore the superiority of the proposed mixed-precision architecture to the conventional structure.
comment: 16 pages, 8 figures. Submitted to IEEE JSAC for possible publication
Sound
☆ Distributed collaborative anomalous sound detection by embedding sharing
To develop a machine sound monitoring system, a method for detecting anomalous sound is proposed. In this paper, we explore a method for multiple clients to collaboratively learn an anomalous sound detection model while keeping their raw data private from each other. In the context of industrial machine anomalous sound detection, each client possesses data from different machines or different operational states, making it challenging to learn through federated learning or split learning. In our proposed method, each client calculates embeddings using a common pre-trained model developed for sound data classification, and these calculated embeddings are aggregated on the server to perform anomalous sound detection through outlier exposure. Experiments showed that our proposed method improves the AUC of anomalous sound detection by an average of 6.8%.
☆ Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator ICASSP 2024
A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.
comment: Accepted to ICASSP 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/
☆ VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SotA models including VALLE and the popular commercial model XTTS-v2. Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit. We encourage readers to listen to the demos at https://jasonppy.github.io/VoiceCraft_web.
comment: Data, code, and model weights are available at https://github.com/jasonppy/VoiceCraft
♻ ☆ BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network ICASSP 2024
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.
comment: Accepted at ICASSP 2024. Equation (5) in the previous version is wrong. We modified it
♻ ☆ Generative Pre-training for Speech with Flow Matching ICLR 2024
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
comment: ICLR 2024
♻ ☆ A unified front-end framework for English text-to-speech synthesis ICASSP 2024
The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However, current research on the English TTS front-end focuses solely on individual modules, neglecting the interdependence between them and resulting in sub-optimal performance for each module. Therefore, this paper proposes a unified front-end framework that captures the dependencies among the English TTS front-end modules. Extensive experiments have demonstrated that the proposed method achieves state-of-the-art (SOTA) performance in all modules.
comment: Accepted in ICASSP 2024
Robotics
☆ A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models
In this paper, we build upon two major recent developments in the field, Diffusion Policies for visuomotor manipulation and large pre-trained multimodal foundational models to obtain a robotic skill learning system. The system can obtain new skills via the behavioral cloning approach of visuomotor diffusion policies given teleoperated demonstrations. Foundational models are being used to perform skill selection given the user's prompt in natural language. Before executing a skill the foundational model performs a precondition check given an observation of the workspace. We compare the performance of different foundational models to this end as well as give a detailed experimental evaluation of the skills taught by the user in simulation and the real world. Finally, we showcase the combined system on a challenging food serving scenario in the real world. Videos of all experimental executions, as well as the process of teaching new skills in simulation and the real world, are available on the project's website.
comment: https://roboskillframework.github.io
☆ BatDeck: Advancing Nano-drone Navigation with Low-power Ultrasound-based Obstacle Avoidance
Nano-drones, distinguished by their agility, minimal weight, and cost-effectiveness, are particularly well-suited for exploration in confined, cluttered and narrow spaces. Recognizing transparent, highly reflective or absorbing materials, such as glass and metallic surfaces is challenging, as classical sensors, such as cameras or laser rangers, often do not detect them. Inspired by bats, which can fly at high speeds in complete darkness with the help of ultrasound, this paper introduces \textit{BatDeck}, a pioneering sensor-deck employing a lightweight and low-power ultrasonic sensor for nano-drone autonomous navigation. This paper first provides insights about sensor characteristics, highlighting the influence of motor noise on the ultrasound readings, then it introduces the results of extensive experimental tests for obstacle avoidance (OA) in a diverse environment. Results show that \textit{BatDeck} allows exploration for a flight time of 8 minutes while covering 136m on average before crash in a challenging environment with transparent and reflective obstacles, proving the effectiveness of ultrasonic sensors for OA on nano-drones.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Synapse: Learning Preferential Concepts from Visual Demonstrations
This paper addresses the problem of preference learning, which aims to learn user-specific preferences (e.g., "good parking spot", "convenient drop-off location") from visual input. Despite its similarity to learning factual concepts (e.g., "red cube"), preference learning is a fundamentally harder problem due to its subjective nature and the paucity of person-specific training data. We address this problem using a new framework called Synapse, which is a neuro-symbolic approach designed to efficiently learn preferential concepts from limited demonstrations. Synapse represents preferences as neuro-symbolic programs in a domain-specific language (DSL) that operates over images, and leverages a novel combination of visual parsing, large language models, and program synthesis to learn programs representing individual preferences. We evaluate Synapse through extensive experimentation including a user case study focusing on mobility-related concepts in mobile robotics and autonomous driving. Our evaluation demonstrates that Synapse significantly outperforms existing baselines as well as its own ablations. The code and other details can be found on the project website https://amrl.cs.utexas.edu/synapse .
comment: 23 pages, 7 figures; Preprint
☆ Domain Adaptive Detection of MAVs: A Benchmark and Noise Suppression Network
Visual detection of Micro Air Vehicles (MAVs) has attracted increasing attention in recent years due to its important application in various tasks. The existing methods for MAV detection assume that the training set and testing set have the same distribution. As a result, when deployed in new domains, the detectors would have a significant performance degradation due to domain discrepancy. In this paper, we study the problem of cross-domain MAV detection. The contributions of this paper are threefold. 1) We propose a Multi-MAV-Multi-Domain (M3D) dataset consisting of both simulation and realistic images. Compared to other existing datasets, the proposed one is more comprehensive in the sense that it covers rich scenes, diverse MAV types, and various viewing angles. A new benchmark for cross-domain MAV detection is proposed based on the proposed dataset. 2) We propose a Noise Suppression Network (NSN) based on the framework of pseudo-labeling and a large-to-small training procedure. To reduce the challenging pseudo-label noises, two novel modules are designed in this network. The first is a prior-based curriculum learning module for allocating adaptive thresholds for pseudo labels with different difficulties. The second is a masked copy-paste augmentation module for pasting truly-labeled MAVs on unlabeled target images and thus decreasing pseudo-label noises. 3) Extensive experimental results verify the superior performance of the proposed method compared to the state-of-the-art ones. In particular, it achieves mAP of 46.9%(+5.8%), 50.5%(+3.7%), and 61.5%(+11.3%) on the tasks of simulation-to-real adaptation, cross-scene adaptation, and cross-camera adaptation, respectively.
comment: 17 pages, 11 figures. Accepted by IEEE Transactions on Automation Science and Engineering
☆ Skill Q-Network: Learning Adaptive Skill Ensemble for Mapless Navigation in Unknown Environments
This paper focuses on the acquisition of mapless navigation skills within unknown environments. We introduce the Skill Q-Network (SQN), a novel reinforcement learning method featuring an adaptive skill ensemble mechanism. Unlike existing methods, our model concurrently learns a high-level skill decision process alongside multiple low-level navigation skills, all without the need for prior knowledge. Leveraging a tailored reward function for mapless navigation, the SQN is capable of learning adaptive maneuvers that incorporate both exploration and goal-directed skills, enabling effective navigation in new environments. Our experiments demonstrate that our SQN can effectively navigate complex environments, exhibiting a 40% higher performance compared to baseline models. Without explicit guidance, SQN discovers how to combine low-level skill policies, showcasing both goal-directed navigations to reach destinations and exploration maneuvers to escape from local minimum regions in challenging scenarios. Remarkably, our adaptive skill ensemble method enables zero-shot transfer to out-of-distribution domains, characterized by unseen observations from non-convex obstacles or uneven, subterranean-like environments.
comment: 8 pages, 8 figures
☆ Trajectory Planning of Robotic Manipulator in Dynamic Environment Exploiting DRL
This study is about the implementation of a reinforcement learning algorithm in the trajectory planning of manipulators. We have a 7-DOF robotic arm to pick and place the randomly placed block at a random target point in an unknown environment. The obstacle is randomly moving which creates a hurdle in picking the object. The objective of the robot is to avoid the obstacle and pick the block with constraints to a fixed timestamp. In this literature, we have applied a deep deterministic policy gradient (DDPG) algorithm and compared the model's efficiency with dense and sparse rewards.
comment: Accepted in ICIESTR-2024
☆ Bridging the Sim-to-Real Gap with Bayesian Inference
We present SIM-FSVGD for learning robot dynamics from data. As opposed to traditional methods, SIM-FSVGD leverages low-fidelity physical priors, e.g., in the form of simulators, to regularize the training of neural network models. While learning accurate dynamics already in the low data regime, SIM-FSVGD scales and excels also when more data is available. We empirically show that learning with implicit physical priors results in accurate mean model estimation as well as precise uncertainty quantification. We demonstrate the effectiveness of SIM-FSVGD in bridging the sim-to-real gap on a high-performance RC racecar system. Using model-based RL, we demonstrate a highly dynamic parking maneuver with drifting, using less than half the data compared to the state of the art.
☆ Symbolic and User-friendly Geometric Algebra Routines (SUGAR) for Computations in Matlab
Geometric algebra (GA) is a mathematical tool for geometric computing, providing a framework that allows a unified and compact approach to geometric relations which in other mathematical systems are typically described using different more complicated elements. This fact has led to an increasing adoption of GA in applied mathematics and engineering problems. However, the scarcity of symbolic implementations of GA and its inherent complexity, requiring a specific mathematical background, make it challenging and less intuitive for engineers to work with. This prevents wider adoption among more applied professionals. To address this challenge, this paper introduces SUGAR (Symbolic and User-friendly Geometric Algebra Routines), an open-source toolbox designed for Matlab and licensed under the MIT License. SUGAR facilitates the translation of GA concepts into Matlab and provides a collection of user-friendly functions tailored for GA computations, including support for symbolic operations. It supports both numeric and symbolic computations in high-dimensional GAs. Specifically tailored for applied mathematics and engineering applications, SUGAR has been meticulously engineered to represent geometric elements and transformations within two and three-dimensional projective and conformal geometric algebras, aligning with established computational methodologies in the literature. Furthermore, SUGAR efficiently handles functions of multivectors, such as exponential, logarithmic, sinusoidal, and cosine functions, enhancing its applicability across various engineering domains, including robotics, control systems, and power electronics. Finally, this work includes four distinct validation examples, demonstrating SUGAR's capabilities across the above-mentioned fields and its practical utility in addressing real-world applied mathematics and engineering problems.
comment: 33 pages, 6 figures, journal paper submitted to ACM TOMS
☆ Technical Development of a Semi-Autonomous Robotic Partition
This technical description details the design and engineering process of a semi-autonomous robotic partition. This robotic partition prototype was subsequently employed in a longer-term evaluation in-the-wild study conducted by the authors in a real-world office setting.
☆ ROXIE: Defining a Robotic eXplanation and Interpretability Engine IROS 2024
In an era where autonomous robots increasingly inhabit public spaces, the imperative for transparency and interpretability in their decision-making processes becomes paramount. This paper presents the overview of a Robotic eXplanation and Interpretability Engine (ROXIE), which addresses this critical need, aiming to demystify the opaque nature of complex robotic behaviors. This paper elucidates the key features and requirements needed for providing information and explanations about robot decision-making processes. It also overviews the suite of software components and libraries available for deployment with ROS 2, empowering users to provide comprehensive explanations and interpretations of robot processes and behaviors, thereby fostering trust and collaboration in human-robot interactions.
comment: 7 pages, 3 figures, 1 tables, Submitted to IROS 2024
☆ Research Challenges for Adaptive Architecture: Empowering Occupants of Multi-Occupancy Buildings
This positional paper outlines our vision of 'adaptive architecture', which involves the integration of robotic technology to physically change an architectural space in supporting the changing needs of its occupants, in response to the CHI'24 workshop "HabiTech - Inhabiting Buildings, Data & Technology" call on "How do new technologies enable and empower the inhabitants of multi-occupancy buildings?". Specifically, while adaptive architecture holds promise for enhancing occupant satisfaction, comfort, and overall health and well-being, there remains a range of research challenges of (1) how it can effectively support individual occupants, while (2) mediating the conflicting needs of collocated others, and (3) integrating meaningfully into the sociocultural characteristics of their building community.
☆ The Adaptive Workplace: Orchestrating Architectural Services around the Wellbeing of Individual Occupants
As the academic consortia members of the EU Horizon project SONATA ("Situation-aware OrchestratioN of AdapTive Architecture"), we respond to the workshop call for "Office Wellbeing by Design: Don't Stand for Anything Less" by proposing the "Adaptive Workplace" concept. In essence, our vision aims to adapt a workplace to the ever-changing needs of individual occupants, instead of that occupants are expected to adapt to their workplace.
☆ Counter-example guided Imitation Learning of Feedback Controllers from Temporal Logic Specifications
We present a novel method for imitation learning for control requirements expressed using Signal Temporal Logic (STL). More concretely we focus on the problem of training a neural network to imitate a complex controller. The learning process is guided by efficient data aggregation based on counter-examples and a coverage measure. Moreover, we introduce a method to evaluate the performance of the learned controller via parameterization and parameter estimation of the STL requirements. We demonstrate our approach with a flying robot case study.
☆ Active Admittance Control with Iterative Learning for General-Purpose Contact-Rich Manipulation
Force interaction is inevitable when robots face multiple operation scenarios. How to make the robot competent in force control for generalized operations such as multi-tasks still remains a challenging problem. Aiming at the reproducibility of interaction tasks and the lack of a generalized force control framework for multi-task scenarios, this paper proposes a novel hybrid control framework based on active admittance control with iterative learning parameters-tunning mechanism. The method adopts admittance control as the underlying algorithm to ensure flexibility, and iterative learning as the high-level algorithm to regulate the parameters of the admittance model. The whole algorithm has flexibility and learning ability, which is capable of achieving the goal of excellent versatility. Four representative interactive robot manipulation tasks are chosen to investigate the consistency and generalisability of the proposed method. Experiments are designed to verify the effectiveness of the whole framework, and an average of 98.21% and 91.52% improvement of RMSE is obtained relative to the traditional admittance control as well as the model-free adaptive control, respectively.
☆ Arm-Constrained Curriculum Learning for Loco-Manipulation of the Wheel-Legged Robot
Incorporating a robotic manipulator into a wheel-legged robot enhances its agility and expands its potential for practical applications. However, the presence of potential instability and uncertainties presents additional challenges for control objectives. In this paper, we introduce an arm-constrained curriculum learning architecture to tackle the issues introduced by adding the manipulator. Firstly, we develop an arm-constrained reinforcement learning algorithm to ensure safety and stability in control performance. Additionally, to address discrepancies in reward settings between the arm and the base, we propose a reward-aware curriculum learning method. The policy is first trained in Isaac gym and transferred to the physical robot to do dynamic grasping tasks, including the door-opening task, fan-twitching task and the relay-baton-picking and following task. The results demonstrate that our proposed approach effectively controls the arm-equipped wheel-legged robot to master dynamic grasping skills, allowing it to chase and catch a moving object while in motion. The code can be found at https://github.com/aCodeDog/legged-robots-manipulation. To view the supplemental video, please visit https://youtu.be/sNXT-rwPNMM.
☆ Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art
Autonomous systems are soon to be ubiquitous, from manufacturing autonomy to agricultural field robots, and from health care assistants to the entertainment industry. The majority of these systems are developed with modular sub-components for decision-making, planning, and control that may be hand-engineered or learning-based. While these existing approaches have been shown to perform well under the situations they were specifically designed for, they can perform especially poorly in rare, out-of-distribution scenarios that will undoubtedly arise at test-time. The rise of foundation models trained on multiple tasks with impressively large datasets from a variety of fields has led researchers to believe that these models may provide common sense reasoning that existing planners are missing. Researchers posit that this common sense reasoning will bridge the gap between algorithm development and deployment to out-of-distribution tasks, like how humans adapt to unexpected scenarios. Large language models have already penetrated the robotics and autonomous systems domains as researchers are scrambling to showcase their potential use cases in deployment. While this application direction is very promising empirically, foundation models are known to hallucinate and generate decisions that may sound reasonable, but are in fact poor. We argue there is a need to step back and simultaneously design systems that can quantify the certainty of a model's decision, and detect when it may be hallucinating. In this work, we discuss the current use cases of foundation models for decision-making tasks, provide a general definition for hallucinations with examples, discuss existing approaches to hallucination detection and mitigation with a focus on decision problems, and explore areas for further research in this exciting field.
comment: 31 pages, 2 tables
☆ Spatially temporally distributed informative path planning for multi-robot systems
This paper investigates the problem of informative path planning for a mobile robotic sensor network in spatially temporally distributed mapping. The robots are able to gather noisy measurements from an area of interest during their movements to build a Gaussian Process (GP) model of a spatio-temporal field. The model is then utilized to predict the spatio-temporal phenomenon at different points of interest. To spatially and temporally navigate the group of robots so that they can optimally acquire maximal information gains while their connectivity is preserved, we propose a novel multistep prediction informative path planning optimization strategy employing our newly defined local cost functions. By using the dual decomposition method, it is feasible and practical to effectively solve the optimization problem in a distributed manner. The proposed method was validated through synthetic experiments utilizing real-world data sets.
☆ Real-time Model Predictive Control with Zonotope-Based Neural Networks for Bipedal Social Navigation
This study addresses the challenge of bipedal navigation in a dynamic human-crowded environment, a research area that remains largely underexplored in the field of legged navigation. We propose two cascaded zonotope-based neural networks: a Pedestrian Prediction Network (PPN) for pedestrians' future trajectory prediction and an Ego-agent Social Network (ESN) for ego-agent social path planning. Representing future paths as zonotopes allows for efficient reachability-based planning and collision checking. The ESN is then integrated with a Model Predictive Controller (ESN-MPC) for footstep planning for our bipedal robot Digit designed by Agility Robotics. ESN-MPC solves for a collision-free optimal trajectory by optimizing through the gradients of ESN. ESN-MPC optimal trajectory is sent to the low-level controller for full-order simulation of Digit. The overall proposed framework is validated with extensive simulations on randomly generated initial settings with varying human crowd densities.
comment: 8 pages, 9 figures
☆ Towards Cooperative Maneuver Planning in Mixed Traffic at Urban Intersections
Connected automated driving promises a significant improvement of traffic efficiency and safety on highways and in urban areas. Apart from sharing of awareness and perception information over wireless communication links, cooperative maneuver planning may facilitate active guidance of connected automated vehicles at urban intersections. Research in automatic intersection management put forth a large body of works that mostly employ rule-based or optimization-based approaches primarily in fully automated simulated environments. In this work, we present two cooperative planning approaches that are capable of handling mixed traffic, i.e., the road being shared by automated vehicles and regular vehicles driven by humans. Firstly, we propose an optimization-based planner trained on real driving data that cyclically selects the most efficient out of multiple predicted coordinated maneuvers. Additionally, we present a cooperative planning approach based on graph-based reinforcement learning, which conquers the lack of ground truth data for cooperative maneuvers. We present evaluation results of both cooperative planners in high-fidelity simulation and real-world traffic. Simulative experiments in fully automated traffic and mixed traffic show that cooperative maneuver planning leads to less delay due to interaction and a reduced number of stops. In real-world experiments with three prototype connected automated vehicles in public traffic, both planners demonstrate their ability to perform efficient cooperative maneuvers.
comment: M. Klimke and M. Mertens are both first authors with equal contribution. 11 pages, 10 figures, 2 tables, submitted to IEEE Transactions on Intelligent Vehicles
☆ Producing and Leveraging Online Map Uncertainty in Trajectory Prediction CVPR 2024
High-definition (HD) maps have played an integral role in the development of modern autonomous vehicle (AV) stacks, albeit with high associated labeling and maintenance costs. As a result, many recent works have proposed methods for estimating HD maps online from sensor data, enabling AVs to operate outside of previously-mapped regions. However, current online map estimation approaches are developed in isolation of their downstream tasks, complicating their integration in AV stacks. In particular, they do not produce uncertainty or confidence estimates. In this work, we extend multiple state-of-the-art online map estimation methods to additionally estimate uncertainty and show how this enables more tightly integrating online mapping with trajectory forecasting. In doing so, we find that incorporating uncertainty yields up to 50% faster training convergence and up to 15% better prediction performance on the real-world nuScenes driving dataset.
comment: 14 pages, 14 figures, 6 tables. CVPR 2024
☆ AeroBridge: Autonomous Drone Handoff System for Emergency Battery Service
This paper proposes an Emergency Battery Service (EBS) for drones in which an EBS drone flies to a drone in the field with a depleted battery and transfers a fresh battery to the exhausted drone. The authors present a unique battery transfer mechanism and drone localization that uses the Cross Marker Position (CMP) method. The main challenges include a stable and balanced transfer that precisely localizes the receiver drone. The proposed EBS drone mitigates the effects of downwash due to the vertical proximity between the drones by implementing diagonal alignment with the receiver, reducing the distance to 0.5 m between the two drones. CFD analysis shows that diagonal instead of perpendicular alignment minimizes turbulence, and the authors verify the actual system for change in output airflow and thrust measurements. The CMP marker-based localization method enables position lock for the EBS drone with up to 0.9 cm accuracy. The performance of the transfer mechanism is validated experimentally by successful mid-air transfer in 5 seconds, where the EBS drone is within 0.5 m vertical distance from the receiver drone, wherein 4m/s turbulence does not affect the transfer process.
☆ Enhancing Visual Place Recognition via Fast and Slow Adaptive Biasing in Event Cameras
Event cameras are increasingly popular in robotics due to their beneficial features, such as low latency, energy efficiency, and high dynamic range. Nevertheless, their downstream task performance is greatly influenced by the optimization of bias parameters. These parameters, for instance, regulate the necessary change in light intensity to trigger an event, which in turn depends on factors such as the environment lighting and camera motion. This paper introduces feedback control algorithms that automatically tune the bias parameters through two interacting methods: 1) An immediate, on-the-fly fast adaptation of the refractory period, which sets the minimum interval between consecutive events, and 2) if the event rate exceeds the specified bounds even after changing the refractory period repeatedly, the controller adapts the pixel bandwidth and event thresholds, which stabilizes after a short period of noise events across all pixels (slow adaptation). Our evaluation focuses on the visual place recognition task, where incoming query images are compared to a given reference database. We conducted comprehensive evaluations of our algorithms' adaptive feedback control in real-time. To do so, we collected the QCR-Fast-and-Slow dataset that contains DAVIS346 event camera streams from 366 repeated traversals of a Scout Mini robot navigating through a 100 meter long indoor lab setting (totaling over 35km distance traveled) in varying brightness conditions with ground truth location information. Our proposed feedback controllers result in superior performance when compared to the standard bias settings and prior feedback control methods. Our findings also detail the impact of bias adjustments on task performance and feature ablation studies on the fast and slow adaptation mechanisms.
comment: 8 pages, 9 figures, paper under review
☆ Terrain-Attentive Learning for Efficient 6-DoF Kinodynamic Modeling on Vertically Challenging Terrain
Wheeled robots have recently demonstrated superior mechanical capability to traverse vertically challenging terrain (e.g., extremely rugged boulders comparable in size to the vehicles themselves). Negotiating such terrain introduces significant variations of vehicle pose in all six Degrees-of-Freedom (DoFs), leading to imbalanced contact forces, varying momentum, and chassis deformation due to non-rigid tires and suspensions. To autonomously navigate on vertically challenging terrain, all these factors need to be efficiently reasoned within limited onboard computation and strict real-time constraints. In this paper, we propose a 6-DoF kinodynamics learning approach that is attentive only to the specific underlying terrain critical to the current vehicle-terrain interaction, so that it can be efficiently queried in real-time motion planners onboard small robots. Physical experiment results show our Terrain-Attentive Learning demonstrates on average 51.1% reduction in model prediction error among all 6 DoFs compared to a state-of-the-art model for vertically challenging terrain.
☆ ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation
In medical and industrial domains, providing guidance for assembly processes is critical to ensure efficiency and safety. Errors in assembly can lead to significant consequences such as extended surgery times, and prolonged manufacturing or maintenance times in industry. Assembly scenarios can benefit from in-situ AR visualization to provide guidance, reduce assembly times and minimize errors. To enable in-situ visualization 6D pose estimation can be leveraged. Existing 6D pose estimation techniques primarily focus on individual objects and static captures. However, assembly scenarios have various dynamics including occlusion during assembly and dynamics in the assembly objects appearance. Existing work, combining object detection/6D pose estimation and assembly state detection focuses either on pure deep learning-based approaches, or limit the assembly state detection to building blocks. To address the challenges of 6D pose estimation in combination with assembly state detection, our approach ASDF builds upon the strengths of YOLOv8, a real-time capable object detection framework. We extend this framework, refine the object pose and fuse pose knowledge with network-detected pose information. Utilizing our late fusion in our Pose2State module results in refined 6D pose estimation and assembly state detection. By combining both pose and state information, our Pose2State module predicts the final assembly state with precision. Our evaluation on our ASDF dataset shows that our Pose2State module leads to an improved assembly state detection and that the improvement of the assembly state further leads to a more robust 6D pose estimation. Moreover, on the GBOT dataset, we outperform the pure deep learning-based network, and even outperform the hybrid and pure tracking-based approaches.
☆ SE(3) Linear Parameter Varying Dynamical Systems for Globally Asymptotically Stable End-Effector Control
Linear Parameter Varying Dynamical Systems (LPV-DS) encode trajectories into an autonomous first-order DS that enables reactive responses to perturbations, while ensuring globally asymptotic stability at the target. However, the current LPV-DS framework is established on Euclidean data only and has not been applicable to broader robotic applications requiring pose control. In this paper we present an extension to the current LPV-DS framework, named Quaternion-DS, which efficiently learns a DS-based motion policy for orientation. Leveraging techniques from differential geometry and Riemannian statistics, our approach properly handles the non-Euclidean orientation data in quaternion space, enabling the integration with positional control, namely SE(3) LPV-DS, so that the synergistic behaviour within the full SE(3) pose is preserved. Through simulation and real robot experiments, we validate our method, demonstrating its ability to efficiently and accurately reproduce the original SE(3) trajectory while exhibiting strong robustness to perturbations in task space.
☆ Bipedal Safe Navigation over Uncertain Rough Terrain: Unifying Terrain Mapping and Locomotion Stability
We study the problem of bipedal robot navigation in complex environments with uncertain and rough terrain. In particular, we consider a scenario in which the robot is expected to reach a desired goal location by traversing an environment with uncertain terrain elevation. Such terrain uncertainties induce not only untraversable regions but also robot motion perturbations. Thus, the problems of terrain mapping and locomotion stability are intertwined. We evaluate three different kernels for Gaussian process (GP) regression to learn the terrain elevation. We also learn the motion deviation resulting from both the terrain as well as the discrepancy between the reduced-order Prismatic Inverted Pendulum Model used for planning and the full-order locomotion dynamics. We propose a hierarchical locomotion-dynamics-aware sampling-based navigation planner. The global navigation planner plans a series of local waypoints to reach the desired goal locations while respecting locomotion stability constraints. Then, a local navigation planner is used to generate a sequence of dynamically feasible footsteps to reach local waypoints. We develop a novel trajectory evaluation metric to minimize motion deviation and maximize information gain of the terrain elevation map. We evaluate the efficacy of our planning framework on Digit bipedal robot simulation in MuJoCo.
comment: 9 pages, 7 figures
☆ Human Stress Response and Perceived Safety during Encounters with Quadruped Robots
Despite the rise of mobile robot deployments in home and work settings, perceived safety of users and bystanders is understudied in the human-robot interaction (HRI) literature. To address this, we present a study designed to identify elements of a human-robot encounter that correlate with observed stress response. Stress is a key component of perceived safety and is strongly associated with human physiological response. In this study a Boston Dynamics Spot and a Unitree Go1 navigate autonomously through a shared environment occupied by human participants wearing multimodal physiological sensors to track their electrocardiography (ECG) and electrodermal activity (EDA). The encounters are varied through several trials and participants self-rate their stress levels after each encounter. The study resulted in a multidimensional dataset archiving various objective and subjective aspects of a human-robot encounter, containing insights for understanding perceived safety in such encounters. To this end, acute stress responses were decoded from the human participants' ECG and EDA and compared across different human-robot encounter conditions. Statistical analysis of data indicate that on average (1) participants feel more stress during encounters compared to baselines, (2) participants feel more stress encountering multiple robots compared to a single robot and (3) participants stress increases during navigation behavior compared with search behavior.
comment: 7 pages, 7 figs, 5 tables
☆ Exploring CausalWorld: Enhancing robotic manipulation via knowledge transfer and curriculum learning
This study explores a learning-based tri-finger robotic arm manipulating task, which requires complex movements and coordination among the fingers. By employing reinforcement learning, we train an agent to acquire the necessary skills for proficient manipulation. To enhance the efficiency and effectiveness of the learning process, two knowledge transfer strategies, fine-tuning and curriculum learning, were utilized within the soft actor-critic architecture. Fine-tuning allows the agent to leverage pre-trained knowledge and adapt it to new tasks. Several variations like model transfer, policy transfer, and across-task transfer were implemented and evaluated. To eliminate the need for pretraining, curriculum learning decomposes the advanced task into simpler, progressive stages, mirroring how humans learn. The number of learning stages, the context of the sub-tasks, and the transition timing were found to be the critical design parameters. The key factors of two learning strategies and corresponding effects were explored in context-aware and context-unaware scenarios, enabling us to identify the scenarios where the methods demonstrate optimal performance, derive conclusive insights, and contribute to a broader range of learning-based engineering applications.
☆ Impact-Aware Bimanual Catching of Large-Momentum Objects
This paper investigates one of the most challenging tasks in dynamic manipulation -- catching large-momentum moving objects. Beyond the realm of quasi-static manipulation, dealing with highly dynamic objects can significantly improve the robot's capability of interacting with its surrounding environment. Yet, the inevitable motion mismatch between the fast moving object and the approaching robot will result in large impulsive forces, which lead to the unstable contacts and irreversible damage to both the object and the robot. To address the above problems, we propose an online optimization framework to: 1) estimate and predict the linear and angular motion of the object; 2) search and select the optimal contact locations across every surface of the object to mitigate impact through sequential quadratic programming (SQP); 3) simultaneously optimize the end-effector motion, stiffness, and contact force for both robots using multi-mode trajectory optimization (MMTO); and 4) realise the impact-aware catching motion on the compliant robotic system based on indirect force controller. We validate the impulse distribution, contact selection, and impact-aware MMTO algorithms in simulation and demonstrate the benefits of the proposed framework in real-world experiments including catching large-momentum moving objects with well-defined motion, constrained motion and free-flying motion.
☆ DASA: Delay-Adaptive Multi-Agent Stochastic Approximation
We consider a setting in which $N$ agents aim to speedup a common Stochastic Approximation (SA) problem by acting in parallel and communicating with a central server. We assume that the up-link transmissions to the server are subject to asynchronous and potentially unbounded time-varying delays. To mitigate the effect of delays and stragglers while reaping the benefits of distributed computation, we propose \texttt{DASA}, a Delay-Adaptive algorithm for multi-agent Stochastic Approximation. We provide a finite-time analysis of \texttt{DASA} assuming that the agents' stochastic observation processes are independent Markov chains. Significantly advancing existing results, \texttt{DASA} is the first algorithm whose convergence rate depends only on the mixing time $\tmix$ and on the average delay $\tau_{avg}$ while jointly achieving an $N$-fold convergence speedup under Markovian sampling. Our work is relevant for various SA applications, including multi-agent and distributed temporal difference (TD) learning, Q-learning and stochastic optimization with correlated data.
☆ TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models
Classical planning formulations like the Planning Domain Definition Language (PDDL) admit action sequences guaranteed to achieve a goal state given an initial state if any are possible. However, reasoning problems defined in PDDL do not capture temporal aspects of action taking, for example that two agents in the domain can execute an action simultaneously if postconditions of each do not interfere with preconditions of the other. A human expert can decompose a goal into largely independent constituent parts and assign each agent to one of these subgoals to take advantage of simultaneous actions for faster execution of plan steps, each using only single agent planning. By contrast, large language models (LLMs) used for directly inferring plan steps do not guarantee execution success, but do leverage commonsense reasoning to assemble action sequences. We combine the strengths of classical planning and LLMs by approximating human intuitions for two-agent planning goal decomposition. We demonstrate that LLM-based goal decomposition leads to faster planning times than solving multi-agent PDDL problems directly while simultaneously achieving fewer plan execution steps than a single agent plan alone and preserving execution success. Additionally, we find that LLM-based approximations of subgoals can achieve similar multi-agent execution steps than those specified by human experts. Website and resources at https://glamor-usc.github.io/twostep
comment: 12 pages
☆ Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks IROS 2024
Recent works in Task and Motion Planning (TAMP) show that training control policies on language-supervised robot trajectories with quality labeled data markedly improves agent task success rates. However, the scarcity of such data presents a significant hurdle to extending these methods to general use cases. To address this concern, we present an automated framework to decompose trajectory data into temporally bounded and natural language-based descriptive sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs) including both Large Language Models (LLMs) and Vision Language Models (VLMs). Our framework provides both time-based and language-based descriptions for lower-level sub-tasks that comprise full trajectories. To rigorously evaluate the quality of our automatic labeling framework, we contribute an algorithm SIMILARITY to produce two novel metrics, temporal similarity and semantic similarity. The metrics measure the temporal alignment and semantic fidelity of language descriptions between two sub-task decompositions, namely an FM sub-task decomposition prediction and a ground-truth sub-task decomposition. We present scores for temporal similarity and semantic similarity above 90%, compared to 30% of a randomized baseline, for multiple robotic environments, demonstrating the effectiveness of our proposed framework. Our results enable building diverse, large-scale, language-supervised datasets for improved robotic TAMP.
comment: 8 pages, 3 figures. IROS 2024 Submission
☆ Speeding Up Path Planning via Reinforcement Learning in MCTS for Automated Parking
In this paper, we address a method that integrates reinforcement learning into the Monte Carlo tree search to boost online path planning under fully observable environments for automated parking tasks. Sampling-based planning methods under high-dimensional space can be computationally expensive and time-consuming. State evaluation methods are useful by leveraging the prior knowledge into the search steps, making the process faster in a real-time system. Given the fact that automated parking tasks are often executed under complex environments, a solid but lightweight heuristic guidance is challenging to compose in a traditional analytical way. To overcome this limitation, we propose a reinforcement learning pipeline with a Monte Carlo tree search under the path planning framework. By iteratively learning the value of a state and the best action among samples from its previous cycle's outcomes, we are able to model a value estimator and a policy generator for given states. By doing that, we build up a balancing mechanism between exploration and exploitation, speeding up the path planning process while maintaining its quality without using human expert driver data.
☆ PROSPECT: Precision Robot Spectroscopy Exploration and Characterization Tool
Near Infrared (NIR) spectroscopy is widely used in industrial quality control and automation to test the purity and material quality of items. In this research, we propose a novel sensorized end effector and acquisition strategy to capture spectral signatures from objects and register them with a 3D point cloud. Our methodology first takes a 3D scan of an object generated by a time-of-flight depth camera and decomposes the object into a series of planned viewpoints covering the surface. We generate motion plans for a robot manipulator and end-effector to visit these viewpoints while maintaining a fixed distance and surface normal to ensure maximal spectral signal quality enabled by the spherical motion of the end-effector. By continuously acquiring surface reflectance values as the end-effector scans the target object, the autonomous system develops a four-dimensional model of the target object: position in an R^3 coordinate frame, and a wavelength vector denoting the associated spectral signature. We demonstrate this system in building spectral-spatial object profiles of increasingly complex geometries. As a point of comparison, we show our proposed system and spectral acquisition planning yields more consistent signal signals than naive point scanning strategies for capturing spectral information over complex surface geometries. Our work represents a significant step towards high-resolution spectral-spatial sensor fusion for automated quality assessment.
☆ Dyna-LfLH: Learning Agile Navigation in Dynamic Environments from Learned Hallucination IROS
This paper presents a self-supervised learning method to safely learn a motion planner for ground robots to navigate environments with dense and dynamic obstacles. When facing highly-cluttered, fast-moving, hard-to-predict obstacles, classical motion planners may not be able to keep up with limited onboard computation. For learning-based planners, high-quality demonstrations are difficult to acquire for imitation learning while reinforcement learning becomes inefficient due to the high probability of collision during exploration. To safely and efficiently provide training data, the Learning from Hallucination (LfH) approaches synthesize difficult navigation environments based on past successful navigation experiences in relatively easy or completely open ones, but unfortunately cannot address dynamic obstacles. In our new Dynamic Learning from Learned Hallucination (Dyna-LfLH), we design and learn a novel latent distribution and sample dynamic obstacles from it, so the generated training data can be used to learn a motion planner to navigate in dynamic environments. Dyna-LfLH is evaluated on a ground robot in both simulated and physical environments and achieves up to 25% better success rate compared to baselines.
comment: Submitted to International Conference on Intelligent Robots and Systems (IROS) 2024
☆ Multi-Contact Inertial Estimation and Localization in Legged Robots
Optimal estimation is a promising tool for multi-contact inertial estimation and localization. To harness its advantages in robotics, it is crucial to solve these large and challenging optimization problems efficiently. To tackle this, we (i) develop a multiple-shooting solver that exploits both temporal and parametric structures through a parametrized Riccati recursion. Additionally, we (ii) propose an inertial local manifold that ensures its full physical consistency. It also enhances convergence compared to the singularity-free log-Cholesky approach. To handle its singularities, we (iii) introduce a nullspace approach in our optimal estimation solver. We (iv) finally develop the analytical derivatives of contact dynamics for both inertial parametrizations. Our framework can successfully solve estimation problems for complex maneuvers such as brachiation in humanoids. We demonstrate its numerical capabilities across various robotics tasks and its benefits in experimental trials with the Go1 robot.
☆ Hearing the shape of an arena with spectral swarm robotics
Swarm robotics promises adaptability to unknown situations and robustness against failures. However, it still struggles with global tasks that require understanding the broader context in which the robots operate, such as identifying the shape of the arena in which the robots are embedded. Biological swarms, such as shoals of fish, flocks of birds, and colonies of insects, routinely solve global geometrical problems through the diffusion of local cues. This paradigm can be explicitly described by mathematical models that could be directly computed and exploited by a robotic swarm. Diffusion over a domain is mathematically encapsulated by the Laplacian, a linear operator that measures the local curvature of a function. Crucially the geometry of a domain can generally be reconstructed from the eigenspectrum of its Laplacian. Here we introduce spectral swarm robotics where robots diffuse information to their neighbors to emulate the Laplacian operator - enabling them to "hear" the spectrum of their arena. We reveal a universal scaling that links the optimal number of robots (a global parameter) with their optimal radius of interaction (a local parameter). We validate experimentally spectral swarm robotics under challenging conditions with the one-shot classification of arena shapes using a sparse swarm of Kilobots. Spectral methods can assist with challenging tasks where robots need to build an emergent consensus on their environment, such as adaptation to unknown terrains, division of labor, or quorum sensing. Spectral methods may extend beyond robotics to analyze and coordinate swarms of agents of various natures, such as traffic or crowds, and to better understand the long-range dynamics of natural systems emerging from short-range interactions.
☆ Adaptive Step Duration for Precise Foot Placement: Achieving Robust Bipedal Locomotion on Terrains with Restricted Footholds
This paper introduces a novel multi-step preview foot placement planning algorithm designed to enhance the robustness of bipedal robotic walking across challenging terrains with restricted footholds. Traditional one-step preview planning struggles to maintain stability when stepping areas are severely limited, such as with random stepping stones. In this work, we developed a discrete-time Model Predictive Control (MPC) based on the step-to-step discrete evolution of the Divergent Component of Motion (DCM) of bipedal locomotion. This approach adaptively changes the step duration for optimal foot placement under constraints, thereby ensuring the robot's operational viability over multiple future steps and significantly improving its ability to navigate through environments with tight constraints on possible footholds. The effectiveness of this planning algorithm is demonstrated through simulations that include a variety of complex stepping-stone configurations and external perturbations. These tests underscore the algorithm's improved performance for navigating foothold-restricted environments, even with the presence of external disturbances.
comment: 8 pages, 8 figures, submitted to CDC 2024, for associated simulation video, see https://youtu.be/2jhikPlZmbE
☆ Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
Grounding the common-sense reasoning of Large Language Models in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://sites.google.com/view/grounding-plans
☆ Vision-Based Dexterous Motion Planning by Dynamic Movement Primitives with Human Hand Demonstration
This paper proposes a vision-based framework for a 7-degree-of-freedom robotic manipulator, with the primary objective of facilitating its capacity to acquire information from human hand demonstrations for the execution of dexterous pick-and-place tasks. Most existing works only focus on the position demonstration without considering the orientations. In this paper, by employing a single depth camera, MediaPipe is applied to generate the three-dimensional coordinates of a human hand, thereby comprehensively recording the hand's motion, encompassing the trajectory of the wrist, orientation of the hand, and the grasp motion. A mean filter is applied during data pre-processing to smooth the raw data. The demonstration is designed to pick up an object at a specific angle, navigate around obstacles in its path and subsequently, deposit it within a sloped container. The robotic system demonstrates its learning capabilities, facilitated by the implementation of Dynamic Movement Primitives, enabling the assimilation of user actions into its trajectories with different start and end poi
☆ Berry Twist: a Twisting-Tube Soft Robotic Gripper for Blackberry Harvesting
As global demand for fruits and vegetables continues to rise, the agricultural industry faces challenges in securing adequate labor. Robotic harvesting devices offer a promising solution to solve this issue. However, harvesting delicate fruits, notably blackberries, poses unique challenges due to their fragility. This study introduces and evaluates a prototype robotic gripper specifically designed for blackberry harvesting. The gripper features an innovative fabric tube mechanism employing motorized twisting action to gently envelop the fruit, ensuring uniform pressure application and minimizing damage. Three types of tubes were developed, varying in elasticity and compressibility using foam padding, spandex, and food-safe cotton cheesecloth. Performance testing focused on assessing each gripper's ability to detach and release blackberries, with emphasis on quantifying damage rates. Results indicate the proposed gripper achieved an 82% success rate in detaching blackberries and a 95% success rate in releasing them, showcasing the promised potential for robotic harvesting applications.
☆ A Comparative Analysis of Visual Odometry in Virtual and Real-World Railways Environments
Perception tasks play a crucial role in the development of automated operations and systems across multiple application fields. In the railway transportation domain, these tasks can improve the safety, reliability, and efficiency of various perations, including train localization, signal recognition, and track discrimination. However, collecting considerable and precisely labeled datasets for testing such novel algorithms poses extreme challenges in the railway environment due to the severe restrictions in accessing the infrastructures and the practical difficulties associated with properly equipping trains with the required sensors, such as cameras and LiDARs. The remarkable innovations of graphic engine tools offer new solutions to craft realistic synthetic datasets. To illustrate the advantages of employing graphic simulation for early-stage testing of perception tasks in the railway domain, this paper presents a comparative analysis of the performance of a SLAM algorithm applied both in a virtual synthetic environment and a real-world scenario. The analysis leverages virtual railway environments created with the latest version of Unreal Engine, facilitating data collection and allowing the examination of challenging scenarios, including low-visibility, dangerous operational modes, and complex environments. The results highlight the feasibility and potentiality of graphic simulation to advance perception tasks in the railway domain.
☆ Trajectory Optimization with Global Yaw Parameterization for Field-of-View Constrained Autonomous Flight
Trajectory generation for quadrotors with limited field-of-view sensors has numerous applications such as aerial exploration, coverage, inspection, videography, and target tracking. Most previous works simplify the task of optimizing yaw trajectories by either aligning the heading of the robot with its velocity, or potentially restricting the feasible space of candidate trajectories by using a limited yaw domain to circumvent angular singularities. In this paper, we propose a novel \textit{global} yaw parameterization method for trajectory optimization that allows a 360-degree yaw variation as demanded by the underlying algorithm. This approach effectively bypasses inherent singularities by including supplementary quadratic constraints and transforming the final decision variables into the desired state representation. This method significantly reduces the needed control effort, and improves optimization feasibility. Furthermore, we apply the method to several examples of different applications that require jointly optimizing over both the yaw and position trajectories. Ultimately, we present a comprehensive numerical analysis and evaluation of our proposed method in both simulation and real-world experiments.
☆ Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding
Safety-critical 3D scene understanding tasks necessitate not only accurate but also confident predictions from 3D perception models. This study introduces Calib3D, a pioneering effort to benchmark and scrutinize the reliability of 3D scene understanding models from an uncertainty estimation viewpoint. We comprehensively evaluate 28 state-of-the-art models across 10 diverse 3D datasets, uncovering insightful phenomena that cope with both the aleatoric and epistemic uncertainties in 3D scene understanding. We discover that despite achieving impressive levels of accuracy, existing models frequently fail to provide reliable uncertainty estimates -- a pitfall that critically undermines their applicability in safety-sensitive contexts. Through extensive analysis of key factors such as network capacity, LiDAR representations, rasterization resolutions, and 3D data augmentation techniques, we correlate these aspects directly with the model calibration efficacy. Furthermore, we introduce DeptS, a novel depth-aware scaling approach aimed at enhancing 3D model calibration. Extensive experiments across a wide range of configurations validate the superiority of our method. We hope this work could serve as a cornerstone for fostering reliable 3D scene understanding. Code and benchmark toolkits are publicly available.
comment: Preprint; 37 pages, 8 figures, 11 tables; Code at https://github.com/ldkong1205/Calib3D
☆ Optimizing LiDAR Placements for Robust Driving Perception in Adverse Conditions
The robustness of driving perception systems under unprecedented conditions is crucial for safety-critical usages. Latest advancements have prompted increasing interests towards multi-LiDAR perception. However, prevailing driving datasets predominantly utilize single-LiDAR systems and collect data devoid of adverse conditions, failing to capture the complexities of real-world environments accurately. Addressing these gaps, we proposed Place3D, a full-cycle pipeline that encompasses LiDAR placement optimization, data generation, and downstream evaluations. Our framework makes three appealing contributions. 1) To identify the most effective configurations for multi-LiDAR systems, we introduce a Surrogate Metric of the Semantic Occupancy Grids (M-SOG) to evaluate LiDAR placement quality. 2) Leveraging the M-SOG metric, we propose a novel optimization strategy to refine multi-LiDAR placements. 3) Centered around the theme of multi-condition multi-LiDAR perception, we collect a 364,000-frame dataset from both clean and adverse conditions. Extensive experiments demonstrate that LiDAR placements optimized using our approach outperform various baselines. We showcase exceptional robustness in both 3D object detection and LiDAR semantic segmentation tasks, under diverse adverse weather and sensor failure conditions. Code and benchmark toolkit are publicly available.
comment: Preprint; 40 pages, 11 figures, 15 tables; Code at https://github.com/ywyeli/Place3D
☆ DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving
End-to-end driving has made significant progress in recent years, demonstrating benefits such as system simplicity and competitive driving performance under both open-loop and closed-loop settings. Nevertheless, the lack of interpretability and controllability in its driving decisions hinders real-world deployment for end-to-end driving systems. In this paper, we collect a comprehensive end-to-end driving dataset named DriveCoT, leveraging the CARLA simulator. It contains sensor data, control decisions, and chain-of-thought labels to indicate the reasoning process. We utilize the challenging driving scenarios from the CARLA leaderboard 2.0, which involve high-speed driving and lane-changing, and propose a rule-based expert policy to control the vehicle and generate ground truth labels for its reasoning process across different driving aspects and the final decisions. This dataset can serve as an open-loop end-to-end driving benchmark, enabling the evaluation of accuracy in various chain-of-thought aspects and the final decision. In addition, we propose a baseline model called DriveCoT-Agent, trained on our dataset, to generate chain-of-thought predictions and final decisions. The trained model exhibits strong performance in both open-loop and closed-loop evaluations, demonstrating the effectiveness of our proposed dataset.
☆ Bayesian Methods for Trust in Collaborative Multi-Agent Autonomy
Multi-agent, collaborative sensor fusion is a vital component of a multi-national intelligence toolkit. In safety-critical and/or contested environments, adversaries may infiltrate and compromise a number of agents. We analyze state of the art multi-target tracking algorithms under this compromised agent threat model. We prove that the track existence probability test ("track score") is significantly vulnerable to even small numbers of adversaries. To add security awareness, we design a trust estimation framework using hierarchical Bayesian updating. Our framework builds beliefs of trust on tracks and agents by mapping sensor measurements to trust pseudomeasurements (PSMs) and incorporating prior trust beliefs in a Bayesian context. In case studies, our trust estimation algorithm accurately estimates the trustworthiness of tracks/agents, subject to observability limitations.
☆ Learning Symbolic and Subsymbolic Temporal Task Constraints from Bimanual Human Demonstrations IROS 2024
Learning task models of bimanual manipulation from human demonstration and their execution on a robot should take temporal constraints between actions into account. This includes constraints on (i) the symbolic level such as precedence relations or temporal overlap in the execution, and (ii) the subsymbolic level such as the duration of different actions, or their starting and end points in time. Such temporal constraints are crucial for temporal planning, reasoning, and the exact timing for the execution of bimanual actions on a bimanual robot. In our previous work, we addressed the learning of temporal task constraints on the symbolic level and demonstrated how a robot can leverage this knowledge to respond to failures during execution. In this work, we propose a novel model-driven approach for the combined learning of symbolic and subsymbolic temporal task constraints from multiple bimanual human demonstrations. Our main contributions are a subsymbolic foundation of a temporal task model that describes temporal nexuses of actions in the task based on distributions of temporal differences between semantic action keypoints, as well as a method based on fuzzy logic to derive symbolic temporal task constraints from this representation. This complements our previous work on learning comprehensive temporal task models by integrating symbolic and subsymbolic information based on a subsymbolic foundation, while still maintaining the symbolic expressiveness of our previous approach. We compare our proposed approach with our previous pure-symbolic approach and show that we can reproduce and even outperform it. Additionally, we show how the subsymbolic temporal task constraints can synchronize otherwise unimanual movement primitives for bimanual behavior on a humanoid robot.
comment: 8 pages, submitted to IROS 2024
☆ DHP-Mapping: A Dense Panoptic Mapping System with Hierarchical World Representation and Label Optimization Techniques IROS 2024
Maps provide robots with crucial environmental knowledge, thereby enabling them to perform interactive tasks effectively. Easily accessing accurate abstract-to-detailed geometric and semantic concepts from maps is crucial for robots to make informed and efficient decisions. To comprehensively model the environment and effectively manage the map data structure, we propose DHP-Mapping, a dense mapping system that utilizes multiple Truncated Signed Distance Field (TSDF) submaps and panoptic labels to hierarchically model the environment. The output map is able to maintain both voxel- and submap-level metric and semantic information. Two modules are presented to enhance the mapping efficiency and label consistency: (1) an inter-submaps label fusion strategy to eliminate duplicate points across submaps and (2) a conditional random field (CRF) based approach to enhance panoptic labels through object label comprehension and contextual information. We conducted experiments with two public datasets including indoor and outdoor scenarios. Our system performs comparably to state-of-the-art (SOTA) methods across geometry and label accuracy evaluation metrics. The experiment results highlight the effectiveness and scalability of our system, as it is capable of constructing precise geometry and maintaining consistent panoptic labels. Our code is publicly available at https://github.com/hutslib/DHP-Mapping.
comment: Submit to IROS 2024. Project website https://github.com/hutslib/DHP-Mapping
☆ Proprioception Is All You Need: Terrain Classification for Boreal Forests IROS 2024
Recent works in field robotics highlighted the importance of resiliency against different types of terrains. Boreal forests, in particular, are home to many mobility-impeding terrains that should be considered for off-road autonomous navigation. Also, being one of the largest land biomes on Earth, boreal forests are an area where autonomous vehicles are expected to become increasingly common. In this paper, we address this issue by introducing BorealTC, a publicly available dataset for proprioceptive-based terrain classification (TC). Recorded with a Husky A200, our dataset contains 116 min of Inertial Measurement Unit (IMU), motor current, and wheel odometry data, focusing on typical boreal forest terrains, notably snow, ice, and silty loam. Combining our dataset with another dataset from the state-of-the-art, we evaluate both a Convolutional Neural Network (CNN) and the novel state space model (SSM)-based Mamba architecture on a TC task. Interestingly, we show that while CNN outperforms Mamba on each separate dataset, Mamba achieves greater accuracy when trained on a combination of both. In addition, we demonstrate that Mamba's learning capacity is greater than a CNN for increasing amounts of data. We show that the combination of two TC datasets yields a latent space that can be interpreted with the properties of the terrains. We also discuss the implications of merging datasets on classification. Our source code and dataset are publicly available online: https://github.com/norlab-ulaval/BorealTC.
comment: Submitted to the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
☆ TAIL: A Terrain-Aware Multi-Modal SLAM Dataset for Robot Locomotion in Deformable Granular Environments
Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes. In this paper, we first propose a Terrain-Aware multI-modaL (TAIL) dataset tailored to deformable and sandy terrains. It incorporates various types of robotic proprioception and distinct ground interactions for the unique challenges and benchmark of multi-sensor fusion SLAM. The versatile sensor suite comprises stereo frame cameras, multiple ground-pointing RGB-D cameras, a rotating 3D LiDAR, an IMU, and an RTK device. This ensemble is hardware-synchronized, well-calibrated, and self-contained. Utilizing both wheeled and quadrupedal locomotion, we efficiently collect comprehensive sequences to capture rich unstructured scenarios. It spans the spectrum of scope, terrain interactions, scene changes, ground-level properties, and dynamic robot characteristics. We benchmark several state-of-the-art SLAM methods against ground truth and provide performance validations. Corresponding challenges and limitations are also reported. All associated resources are accessible upon request at \url{https://tailrobot.github.io/}.
comment: Submitted to IEEE Robotics and Automation Letters
☆ A Semi-Lagrangian Approach for Time and Energy Path Planning Optimization in Static Flow Fields
Efficient path planning for autonomous mobile robots is a critical problem across numerous domains, where optimizing both time and energy consumption is paramount. This paper introduces a novel methodology that considers the dynamic influence of an environmental flow field and considers geometric constraints, including obstacles and forbidden zones, enriching the complexity of the planning problem. We formulate it as a multi-objective optimal control problem, propose a novel transformation called Harmonic Transformation, and apply a semi-Lagrangian scheme to solve it. The set of Pareto efficient solutions is obtained considering two distinct approaches: a deterministic method and an evolutionary-based one, both of which are designed to make use of the proposed Harmonic Transformation. Through an extensive analysis of these approaches, we demonstrate their efficacy in finding optimized paths.
comment: 12 pages, initial paper submission; Preprint submitted to the IEEE Transactions on Intelligent Transportation Systems
☆ Exploiting Priors from 3D Diffusion Models for RGB-Based One-Shot View Planning IROS 2024
Object reconstruction is relevant for many autonomous robotic tasks that require interaction with the environment. A key challenge in such scenarios is planning view configurations to collect informative measurements for reconstructing an initially unknown object. One-shot view planning enables efficient data collection by predicting view configurations and planning the globally shortest path connecting all views at once. However, geometric priors about the object are required to conduct one-shot view planning. In this work, we propose a novel one-shot view planning approach that utilizes the powerful 3D generation capabilities of diffusion models as priors. By incorporating such geometric priors into our pipeline, we achieve effective one-shot view planning starting with only a single RGB image of the object to be reconstructed. Our planning experiments in simulation and real-world setups indicate that our approach balances well between object reconstruction quality and movement cost.
comment: Sicong Pan and Liren Jin have equal contribution. Submitted to IROS 2024
☆ CurbNet: Curb Detection Framework Based on LiDAR Point Cloud Segmentation
Curb detection is an important function in intelligent driving and can be used to determine drivable areas of the road. However, curbs are difficult to detect due to the complex road environment. This paper introduces CurbNet, a novel framework for curb detection, leveraging point cloud segmentation. Addressing the dearth of comprehensive curb datasets and the absence of 3D annotations, we have developed the 3D-Curb dataset, encompassing 7,100 frames, which represents the largest and most categorically diverse collection of curb point clouds currently available. Recognizing that curbs are primarily characterized by height variations, our approach harnesses spatially-rich 3D point clouds for training. To tackle the challenges presented by the uneven distribution of curb features on the xy-plane and their reliance on z-axis high-frequency features, we introduce the multi-scale and channel attention (MSCA) module, a bespoke solution designed to optimize detection performance. Moreover, we propose an adaptive weighted loss function group, specifically formulated to counteract the imbalance in the distribution of curb point clouds relative to other categories. Our extensive experimentation on 2 major datasets has yielded results that surpass existing benchmarks set by leading curb detection and point cloud segmentation models. By integrating multi-clustering and curve fitting techniques in our post-processing stage, we have substantially reduced noise in curb detection, thereby enhancing precision to 0.8744. Notably, CurbNet has achieved an exceptional average metrics of over 0.95 at a tolerance of just 0.15m, thereby establishing a new benchmark. Furthermore, corroborative real-world experiments and dataset analyzes mutually validate each other, solidifying CurbNet's superior detection proficiency and its robust generalizability.
☆ DBPF: A Framework for Efficient and Robust Dynamic Bin-Picking
Efficiency and reliability are critical in robotic bin-picking as they directly impact the productivity of automated industrial processes. However, traditional approaches, demanding static objects and fixed collisions, lead to deployment limitations, operational inefficiencies, and process unreliability. This paper introduces a Dynamic Bin-Picking Framework (DBPF) that challenges traditional static assumptions. The DBPF endows the robot with the reactivity to pick multiple moving arbitrary objects while avoiding dynamic obstacles, such as the moving bin. Combined with scene-level pose generation, the proposed pose selection metric leverages the Tendency-Aware Manipulability Network optimizing suction pose determination. Heuristic task-specific designs like velocity-matching, dynamic obstacle avoidance, and the resight policy, enhance the picking success rate and reliability. Empirical experiments demonstrate the importance of these components. Our method achieves an average 84% success rate, surpassing the 60% of the most comparable baseline, crucially, with zero collisions. Further evaluations under diverse dynamic scenarios showcase DBPF's robust performance in dynamic bin-picking. Results suggest that our framework offers a promising solution for efficient and reliable robotic bin-picking under dynamics.
comment: 8 pages, 5 figures. This paper has been accepted by IEEE RA-L on 2024-03-24. See the supplementary video at youtube: https://youtu.be/n5af2VsKhkg
☆ Visual Action Planning with Multiple Heterogeneous Agents
Visual planning methods are promising to handle complex settings where extracting the system state is challenging. However, none of the existing works tackles the case of multiple heterogeneous agents which are characterized by different capabilities and/or embodiment. In this work, we propose a method to realize visual action planning in multi-agent settings by exploiting a roadmap built in a low-dimensional structured latent space and used for planning. To enable multi-agent settings, we infer possible parallel actions from a dataset composed of tuples associated with individual actions. Next, we evaluate feasibility and cost of them based on the capabilities of the multi-agent system and endow the roadmap with this information, building a capability latent space roadmap (C-LSR). Additionally, a capability suggestion strategy is designed to inform the human operator about possible missing capabilities when no paths are found. The approach is validated in a simulated burger cooking task and a real-world box packing task.
☆ Low-Cost Teleoperation with Haptic Feedback through Vision-based Tactile Sensors for Rigid and Soft Object Manipulation
Haptic feedback is essential for humans to successfully perform complex and delicate manipulation tasks. A recent rise in tactile sensors has enabled robots to leverage the sense of touch and expand their capability drastically. However, many tasks still need human intervention/guidance. For this reason, we present a teleoperation framework designed to provide haptic feedback to human operators based on the data from camera-based tactile sensors mounted on the robot gripper. Partial autonomy is introduced to prevent slippage of grasped objects during task execution. Notably, we rely exclusively on low-cost off-the-shelf hardware to realize an affordable solution. We demonstrate the versatility of the framework on nine different objects ranging from rigid to soft and fragile ones, using three different operators on real hardware.
comment: https://vision-tactile-manip.github.io/teleop/
☆ ProIn: Learning to Predict Trajectory Based on Progressive Interactions for Autonomous Driving
Accurate motion prediction of pedestrians, cyclists, and other surrounding vehicles (all called agents) is very important for autonomous driving. Most existing works capture map information through an one-stage interaction with map by vector-based attention, to provide map constraints for social interaction and multi-modal differentiation. However, these methods have to encode all required map rules into the focal agent's feature, so as to retain all possible intentions' paths while at the meantime to adapt to potential social interaction. In this work, a progressive interaction network is proposed to enable the agent's feature to progressively focus on relevant maps, in order to better learn agents' feature representation capturing the relevant map constraints. The network progressively encode the complex influence of map constraints into the agent's feature through graph convolutions at the following three stages: after historical trajectory encoder, after social interaction, and after multi-modal differentiation. In addition, a weight allocation mechanism is proposed for multi-modal training, so that each mode can obtain learning opportunities from a single-mode ground truth. Experiments have validated the superiority of progressive interactions to the existing one-stage interaction, and demonstrate the effectiveness of each component. Encouraging results were obtained in the challenging benchmarks.
♻ ☆ A Modular Pneumatic Soft Gripper Design for Aerial Grasping and Landing
Aerial robots have garnered significant attention due to their potential applications in various industries, such as inspection, search and rescue, and drone delivery. Successful missions often depend on the ability of these robots to grasp and land effectively. This paper presents a novel modular soft gripper design tailored explicitly for aerial grasping and landing operations. The proposed modular pneumatic soft gripper incorporates a feed-forward proportional controller to regulate pressure, enabling compliant gripping capabilities. The modular connectors of the soft fingers offer two configurations for the 4-tip soft gripper, H-base (cylindrical) and X-base (spherical), allowing adaptability to different target objects. Additionally, the gripper can serve as a soft landing gear when deflated, eliminating the need for an extra landing gear. This design reduces weight, simplifies aerial manipulation control, and enhances flight efficiency. We demonstrate the efficacy of indoor aerial grasping and achieve a maximum payload of 217 g using the proposed soft aerial vehicle and its H-base pneumatic soft gripper (808 g).
comment: 7 pages, 13 figures, accepted by IEEE RoboSoft 2024
♻ ☆ Robust Integral Consensus Control of Multi-Agent Networks Perturbed by Matched and Unmatched Disturbances: The Case of Directed Graphs
This work presents a new method to design consensus controllers for perturbed double integrator systems whose interconnection is described by a directed graph containing a rooted spanning tree. We propose new robust controllers to solve the consensus and synchronization problems when the systems are under the effects of matched and unmatched disturbances. In both problems, we present simple continuous controllers, whose integral actions allow us to handle the disturbances. A rigorous stability analysis based on Lyapunov's direct method for unperturbed networked systems is presented. To assess the performance of our result, a representative simulation study is presented.
♻ ☆ A Group Theoretic Metric for Robot State Estimation Leveraging Chebyshev Interpolation ICRA 2024
We propose a new metric for robot state estimation based on the recently introduced $\text{SE}_2(3)$ Lie group definition. Our metric is related to prior metrics for SLAM but explicitly takes into account the linear velocity of the state estimate, improving over current pose-based trajectory analysis. This has the benefit of providing a single, quantitative metric to evaluate state estimation algorithms against, while being compatible with existing tools and libraries. Since ground truth data generally consists of pose data from motion capture systems, we also propose an approach to compute the ground truth linear velocity based on polynomial interpolation. Using Chebyshev interpolation and a pseudospectral parameterization, we can accurately estimate the ground truth linear velocity of the trajectory in an optimal fashion with best approximation error. We demonstrate how this approach performs on multiple robotic platforms where accurate state estimation is vital, and compare it to alternative approaches such as finite differences. The pseudospectral parameterization also provides a means of trajectory data compression as an additional benefit. Experimental results show our method provides a valid and accurate means of comparing state estimation systems, which is also easy to interpret and report.
comment: Accepted to ICRA 2024
♻ ☆ I-PHYRE: Interactive Physical Reasoning ICLR 2024
Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.
comment: 21 pages, ICLR 2024
♻ ☆ OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models NAACL 2024
Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object Navigation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user's demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a Versatile Semantic Score Map (VSSM). Then, by conducting common sense reasoning on VSSM, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method's effectiveness. Furthermore, we perform real robot demonstrations to validate our method's open-set-ness and generalizability to real-world environments.
comment: NAACL 2024 Findings
♻ ☆ Towards Massive Interaction with Generalist Robotics: A Systematic Review of XR-enabled Remote Human-Robot Interaction Systems
This survey provides an exhaustive review of the applications of extended reality (XR) technologies in the field of remote human-computer interaction (HRI). We developed a systematic search strategy based on the PRISMA methodology. From the initial 2,561 articles selected, 100 research papers that met our inclusion criteria were included. We categorized and summarized the domain in detail, delving into XR technologies, including augmented reality (AR), virtual reality (VR), and mixed reality (MR), and their applications in facilitating intuitive and effective remote control and interaction with robotic systems.The survey highlights existing articles on the application of XR technologies, user experience enhancement, and various interaction designs for XR in remote HRI, providing insights into current trends and future directions. We also identified potential gaps and opportunities for future research to improve remote HRI systems through XR technology to guide and inform future XR and robotics research.
♻ ☆ Directionality-Aware Mixture Model Parallel Sampling for Efficient Linear Parameter Varying Dynamical System Learning
The Linear Parameter Varying Dynamical System (LPV-DS) is an effective approach that learns stable, time-invariant motion policies using statistical modeling and semi-definite optimization to encode complex motions for reactive robot control. Despite its strengths, the LPV-DS learning approach faces challenges in achieving a high model accuracy without compromising the computational efficiency. To address this, we introduce the Directionality-Aware Mixture Model (DAMM), a novel statistical model that applies the Riemannian metric on the n-sphere $\mathbb{S}^n$ to efficiently blend non-Euclidean directional data with $\mathbb{R}^m$ Euclidean states. Additionally, we develop a hybrid Markov chain Monte Carlo technique that combines Gibbs Sampling with Split/Merge Proposal, allowing for parallel computation to drastically speed up inference. Our extensive empirical tests demonstrate that LPV-DS integrated with DAMM achieves higher reproduction accuracy, better model efficiency, and near real-time/online learning compared to standard estimation methods on various datasets. Lastly, we demonstrate its suitability for incrementally learning multi-behavior policies in real-world robot experiments.
♻ ☆ DTG : Diffusion-based Trajectory Generation for Mapless Global Navigation
We present a novel end-to-end diffusion-based trajectory generation method, DTG, for mapless global navigation in challenging outdoor scenarios with occlusions and unstructured off-road features like grass, buildings, bushes, etc. Given a distant goal, our approach computes a trajectory that satisfies the following goals: (1) minimize the travel distance to the goal; (2) maximize the traversability by choosing paths that do not lie in undesirable areas. Specifically, we present a novel Conditional RNN(CRNN) for diffusion models to efficiently generate trajectories. Furthermore, we propose an adaptive training method that ensures that the diffusion model generates more traversable trajectories. We evaluate our methods in various outdoor scenes and compare the performance with other global navigation algorithms on a Husky robot. In practice, we observe at least a 15% improvement in traveling distance and around a 7% improvement in traversability.
comment: 10 pages
♻ ☆ GelLink: A Compact Multi-phalanx Finger with Vision-based Tactile Sensing and Proprioception ICRA 2024
Compared to fully-actuated robotic end-effectors, underactuated ones are generally more adaptive, robust, and cost-effective. However, state estimation for underactuated hands is usually more challenging. Vision-based tactile sensors, like Gelsight, can mitigate this issue by providing high-resolution tactile sensing and accurate proprioceptive sensing. As such, we present GelLink, a compact, underactuated, linkage-driven robotic finger with low-cost, high-resolution vision-based tactile sensing and proprioceptive sensing capabilities. In order to reduce the amount of embedded hardware, i.e. the cameras and motors, we optimize the linkage transmission with a planar linkage mechanism simulator and develop a planar reflection simulator to simplify the tactile sensing hardware. As a result, GelLink only requires one motor to actuate the three phalanges, and one camera to capture tactile signals along the entire finger. Overall, GelLink is a compact robotic finger that shows adaptability and robustness when performing grasping tasks. The integration of vision-based tactile sensors can significantly enhance the capabilities of underactuated fingers and potentially broaden their future usage.
comment: Supplement video: https://www.youtube.com/watch?v=hZwUpAig5C0 . 7 pages, 9 figures. ICRA 2024 (IEEE International Conference on Robotics and Automation)
♻ ☆ Fast LiDAR Informed Visual Search in Unseen Indoor Environments
This paper explores the problem of planning for visual search without prior map information. We leverage the pixel-wise environment perception problem where one is given wide Field of View 2D scan data and must perform LiDAR segmentation to contextually label points in the surroundings. These pixel classifications provide an informed prior on which to plan next best viewpoints during visual search tasks. We present LIVES: LiDAR Informed Visual Search, a method aimed at finding objects of interest in unknown indoor environments. A robust map-free classifier is trained from expert data collected using a simple cart platform equipped with a map-based classifier. An autonomous exploration planner takes the contextual data from scans and uses that prior to plan viewpoints more likely to yield detection of the search target. We propose a utility function that accounts for traditional metrics like information gain and path cost and for the contextual information. LIVES is baselined against several existing exploration methods in simulation to verify its performance. It is validated in real-world experiments with single and multiple search objects with a Spot robot in two unseen environments. Videos of experiments, implementation details and open source code can be found at https://sites.google.com/view/lives-2024/home.
comment: 6 pages + references. 6 figures. 1 algorithm. 1 table
♻ ☆ Diagrammatic Instructions to Specify Spatial Objectives and Constraints with Applications to Mobile Base Placement
This paper introduces Spatial Diagrammatic Instructions (SDIs), an approach for human operators to specify objectives and constraints that are related to spatial regions in the working environment. Human operators are enabled to sketch out regions directly on camera images that correspond to the objectives and constraints. These sketches are projected to 3D spatial coordinates, and continuous Spatial Instruction Maps (SIMs) are learned upon them. These maps can then be integrated into optimization problems for tasks of robots. In particular, we demonstrate how Spatial Diagrammatic Instructions can be applied to solve the Base Placement Problem of mobile manipulators, which concerns the best place to put the manipulator to facilitate a certain task. Human operators can specify, via sketch, spatial regions of interest for a manipulation task and permissible regions for the mobile manipulator to be at. Then, an optimization problem that maximizes the manipulator's reachability, or coverage, over the designated regions of interest while remaining in the permissible regions is solved. We provide extensive empirical evaluations, and show that our formulation of Spatial Instruction Maps provides accurate representations of user-specified diagrammatic instructions. Furthermore, we demonstrate that our diagrammatic approach to the Mobile Base Placement Problem enables higher quality solutions and faster run-time.
♻ ☆ Greedy Perspectives: Multi-Drone View Planning for Collaborative Perception in Cluttered Environments IROS'24
Deployment of teams of aerial robots could enable large-scale filming of dynamic groups of people (actors) in complex environments for applications in areas such as team sports and cinematography. Toward this end, methods for submodular maximization via sequential greedy planning can be used for scalable optimization of camera views across teams of robots but face challenges with efficient coordination in cluttered environments. Obstacles can produce occlusions and increase chances of inter-robot collision which can violate requirements for near-optimality guarantees. To coordinate teams of aerial robots in filming groups of people in dense environments, a more general view-planning approach is required. We explore how collision and occlusion impact performance in filming applications through the development of a multi-robot multi-actor view planner with an occlusion-aware objective for filming groups of people and compare with a formation planner and a greedy planner that ignores inter-robot collisions. We evaluate our approach based on five test environments and complex multi-actor behaviors. Compared with a formation planner, our sequential planner generates 14% greater view reward over the actors for three scenarios and comparable performance to formation planning on two others. We also observe near identical view rewards for sequential planning both with and without inter-robot collision constraints which indicates that robots are able to avoid collisions without impairing performance in the perception task. Overall, we demonstrate effective coordination of teams of aerial robots for filming groups that may split, merge, or spread apart and in environments cluttered with obstacles that may cause collisions or occlusions.
comment: Submitted to IROS'24; 8 pages, 8 figures, 2 tables
♻ ☆ Optimizing Crowd-Aware Multi-Agent Path Finding through Local Broadcasting with Graph Neural Networks
Multi-Agent Path Finding (MAPF) in crowded environments presents a challenging problem in motion planning, aiming to find collision-free paths for all agents in the system. MAPF finds a wide range of applications in various domains, including aerial swarms, autonomous warehouse robotics, and self-driving vehicles. Current approaches to MAPF generally fall into two main categories: centralized and decentralized planning. Centralized planning suffers from the curse of dimensionality when the number of agents or states increases and thus does not scale well in large and complex environments. On the other hand, decentralized planning enables agents to engage in real-time path planning within a partially observable environment, demonstrating implicit coordination. However, they suffer from slow convergence and performance degradation in dense environments. In this paper, we introduce CRAMP, a novel crowd-aware decentralized reinforcement learning approach to address this problem by enabling efficient local communication among agents via Graph Neural Networks (GNNs), facilitating situational awareness and decision-making capabilities in congested environments. We test CRAMP on simulated environments and demonstrate that our method outperforms the state-of-the-art decentralized methods for MAPF on various metrics. CRAMP improves the solution quality up to 59% measured in makespan and collision count, and up to 35% improvement in success rate in comparison to previous methods.
comment: 8 pages, 5 figures, 2 tables
♻ ☆ SO(2)-Equivariant Downwash Models for Close Proximity Flight
Multirotors flying in close proximity induce aerodynamic wake effects on each other through propeller downwash. Conventional methods have fallen short of providing adequate 3D force-based models that can be incorporated into robust control paradigms for deploying dense formations. Thus, learning a model for these downwash patterns presents an attractive solution. In this paper, we present a novel learning-based approach for modelling the downwash forces that exploits the latent geometries (i.e. symmetries) present in the problem. We demonstrate that when trained with only 5 minutes of real-world flight data, our geometry-aware model outperforms state-of-the-art baseline models trained with more than 15 minutes of data. In dense real-world flights with two vehicles, deploying our model online improves 3D trajectory tracking by nearly 36% on average (and vertical tracking by 56%).
♻ ☆ Decision-Oriented Learning Using Differentiable Submodular Maximization for Multi-Robot Coordination
We present a differentiable, decision-oriented learning framework for cost prediction in a class of multi-robot decision-making problems, in which the robots need to trade off the task performance with the costs of taking actions when they select actions to take. Specifically, we consider the cases where the task performance is measured by a known monotone submodular function (e.g., coverage, mutual information), and the cost of actions depends on the context (e.g., wind and terrain conditions). We need to learn a function that maps the context to the costs. Classically, we treat such a learning problem and the downstream decision-making problem as two decoupled problems, i.e., we first learn to predict the cost function without considering the downstream decision-making problem, and then use the learned function for predicting the cost and using it in the decision-making problem. However, the loss function used in learning a prediction function may not be aligned with the downstream decision-making. We propose a decision-oriented learning framework that incorporates the downstream task performance in the prediction phase via a differentiable optimization layer. The main computational challenge in such a framework is to make the combinatorial optimization, i.e., non-monotone submodular maximization, differentiable. This function is not naturally differentiable. We propose the Differentiable Cost Scaled Greedy algorithm (D-CSG), which is a continuous and differentiable relaxation of CSG. We demonstrate the efficacy of the proposed framework through numerical simulations. The results show that the proposed framework can result in better performance than the traditional two-stage approach.
comment: arXiv admin note: text overlap with arXiv:2303.01543
♻ ☆ Pre-Trained Masked Image Model for Mobile Robot Navigation ICRA 2024
2D top-down maps are commonly used for the navigation and exploration of mobile robots through unknown areas. Typically, the robot builds the navigation maps incrementally from local observations using onboard sensors. Recent works have shown that predicting the structural patterns in the environment through learning-based approaches can greatly enhance task efficiency. While many such works build task-specific networks using limited datasets, we show that the existing foundational vision networks can accomplish the same without any fine-tuning. Specifically, we use Masked Autoencoders, pre-trained on street images, to present novel applications for field-of-view expansion, single-agent topological exploration, and multi-agent exploration for indoor mapping, across different input modalities. Our work motivates the use of foundational vision models for generalized structure prediction-driven applications, especially in the dearth of training data. For more qualitative results see https://raaslab.org/projects/MIM4Robots.
comment: Accepted at ICRA 2024
♻ ☆ C3D: Cascade Control with Change Point Detection and Deep Koopman Learning for Autonomous Surface Vehicles
In this paper, we discuss the development and deployment of a robust autonomous system capable of performing various tasks in the maritime domain under unknown dynamic conditions. We investigate a data-driven approach based on modular design for ease of transfer of autonomy across different maritime surface vessel platforms. The data-driven approach alleviates issues related to a priori identification of system models that may become deficient under evolving system behaviors or shifting, unanticipated, environmental influences. Our proposed learning-based platform comprises a deep Koopman system model and a change point detector that provides guidance on domain shifts prompting relearning under severe exogenous and endogenous perturbations. Motion control of the autonomous system is achieved via an optimal controller design. The Koopman linearized model naturally lends itself to a linear-quadratic regulator (LQR) control design. We propose the C3D control architecture Cascade Control with Change Point Detection and Deep Koopman Learning. The framework is verified in station keeping task on an ASV in both simulation and real experiments. The approach achieved at least 13.9 percent improvement in mean distance error in all test cases compared to the methods that do not consider system changes.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ On the Feedback Law in Stochastic Optimal Nonlinear Control
We consider the problem of nonlinear stochastic optimal control. This problem is thought to be fundamentally intractable owing to Bellman's ``curse of dimensionality". We present a result that shows that repeatedly solving an open-loop deterministic problem from the current state with progressively shorter horizons, similar to Model Predictive Control (MPC), results in a feedback policy that is $O(\epsilon^4)$ near to the true global stochastic optimal policy, \nxx{where $\epsilon$ is a perturbation parameter modulating the noise.} We show that the optimal deterministic feedback problem has a perturbation structure in that higher-order terms of the feedback law do not affect lower-order terms, and that this structure is lost in the optimal stochastic feedback problem. Consequently, solving the Stochastic Dynamic Programming problem is highly susceptible to noise, even when tractable, and in practice, the MPC-type feedback law offers superior performance even for stochastic systems.
comment: arXiv admin note: substantial text overlap with arXiv:2002.10505, arXiv:2002.09478
♻ ☆ Body-mounted MR-conditional Robot for Minimally Invasive Liver Intervention
MR-guided microwave ablation (MWA) has proven effective in treating hepatocellular carcinoma (HCC) with small-sized tumors, but the state-of-the-art technique suffers from sub-optimal workflow due to speed and accuracy of needle placement. This paper presents a compact body-mounted MR-conditional robot that can operate in closed-bore MR scanners for accurate needle guidance. The robotic platform consists of two stacked Cartesian XY stages, each with two degrees of freedom, that facilitate needle guidance. The robot is actuated using 3D-printed pneumatic turbines with MR-conditional bevel gear transmission systems. Pneumatic valves and control mechatronics are located inside the MRI control room and are connected to the robot with pneumatic transmission lines and optical fibers. Free space experiments indicated robot-assisted needle insertion error of 2.6$\pm$1.3 mm at an insertion depth of 80 mm. The MR-guided phantom studies were conducted to verify the MR-conditionality and targeting performance of the robot. Future work will focus on the system optimization and validations in animal trials.
comment: 10 figures
♻ ☆ Mind Meets Robots: A Review of EEG-Based Brain-Robot Interaction Systems
Brain-robot interaction (BRI) empowers individuals to control (semi-)automated machines through their brain activity, either passively or actively. In the past decade, BRI systems have achieved remarkable success, predominantly harnessing electroencephalogram (EEG) signals as the central component. This paper offers an up-to-date and exhaustive examination of 87 curated studies published during the last five years (2018-2023), focusing on identifying the research landscape of EEG-based BRI systems. This review aims to consolidate and underscore methodologies, interaction modes, application contexts, system evaluation, existing challenges, and potential avenues for future investigations in this domain. Based on our analysis, we present a BRI system model with three entities: Brain, Robot, and Interaction, depicting the internal relationships of a BRI system. We especially investigate the essence and principles on interaction modes between human brains and robots, a domain that has not yet been identified anywhere. We then discuss these entities with different dimensions encompassed. Within this model, we scrutinize and classify current research, reveal insights, specify challenges, and provide recommendations for future research trajectories in this field. Meanwhile, we envision our findings offer a design space for future human-robot interaction (HRI) research, informing the creation of efficient BRI frameworks.
♻ ☆ Reinforcement Learning with Options and State Representation
The current thesis aims to explore the reinforcement learning field and build on existing methods to produce improved ones to tackle the problem of learning in high-dimensional and complex environments. It addresses such goals by decomposing learning tasks in a hierarchical fashion known as Hierarchical Reinforcement Learning. We start in the first chapter by getting familiar with the Markov Decision Process framework and presenting some of its recent techniques that the following chapters use. We then proceed to build our Hierarchical Policy learning as an answer to the limitations of a single primitive policy. The hierarchy is composed of a manager agent at the top and employee agents at the lower level. In the last chapter, which is the core of this thesis, we attempt to learn lower-level elements of the hierarchy independently of the manager level in what is known as the "Eigenoption". Based on the graph structure of the environment, Eigenoptions allow us to build agents that are aware of the geometric and dynamic properties of the environment. Their decision-making has a special property: it is invariant to symmetric transformations of the environment, allowing as a consequence to greatly reduce the complexity of the learning task.
comment: Master Thesis 2018, MVA ENS Paris-Saclay, Tokyo RIKEN AIP
♻ ☆ Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps
Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
♻ ☆ Accelerating Model Predictive Control for Legged Robots through Distributed Optimization
This paper presents a novel approach to enhance Model Predictive Control (MPC) for legged robots through Distributed Optimization. Our method focuses on decomposing the robot dynamics into smaller, parallelizable subsystems, and utilizing the Alternating Direction Method of Multipliers (ADMM) to ensure consensus among them. Each subsystem is managed by its own Optimal Control Problem, with ADMM facilitating consistency between their optimizations. This approach not only decreases the computational time but also allows for effective scaling with more complex robot configurations, facilitating the integration of additional subsystems such as articulated arms on a quadruped robot. We demonstrate, through numerical evaluations, the convergence of our approach on two systems with increasing complexity. In addition, we showcase that our approach converges towards the same solution when compared to a state-of-the-art centralized whole-body MPC implementation. Moreover, we quantitatively compare the computational efficiency of our method to the centralized approach, revealing up to a 75\% reduction in computational time. Overall, our approach offers a promising avenue for accelerating MPC solutions for legged robots, paving the way for more effective utilization of the computational performance of modern hardware.
♻ ☆ Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers IJCNN 2024
3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose.
comment: Accepted to IJCNN 2024. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose
♻ ☆ Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback ICLR 2024
Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.
comment: Published as a conference paper at ICLR 2024. The website is available at https://uni-rlhf.github.io/
♻ ☆ Vehicle Trajectory Tracking Through Magnetic Sensors: A Case Study of Two-lane Road
Intelligent Transportation Systems (ITS) have a pressing need for efficient and reliable traffic surveillance solutions. This paper for the first time proposes a surveillance system that utilizes low-cost magnetic sensors for detecting and tracking vehicles continuously along the road. The system uses multiple sensors mounted along the roadside and lane boundaries to capture the movement of vehicles. Real-time measurement data is collected by base stations and processed to produce vehicle trajectories that include position, timestamp, and speed. To address the challenge of tracking vehicles continuously on a road network using a large amount of unlabeled magnetic sensor measurements, we first define a vehicle trajectory tracking problem. We then propose a graph-based data association algorithm to track each detected vehicle, and design a related online algorithm framework respectively. We finally validate the performance via both experimental simulation and real-world road deployment. The experimental results demonstrate that the proposed solution provides a cost-effective solution to capture the driving status of vehicles and on that basis form various traffic safety and efficiency applications.
Systems and Control
☆ A Branch and Bound method for the exact parameter identification of the PK/PD model for anesthetic drugs
We address the problem of parameter identification for the standard pharmacokinetic/pharmacodynamic (PK/PD) model for anesthetic drugs. Our main contribution is the development of a global optimization method that guarantees finding the parameters that minimize the one-step ahead prediction error. The method is based on a branch-and-bound algorithm, that can be applied to solve a more general class of nonlinear regression problems. We present some simulation results, based on a dataset of twelve patients. In these simulations, we are always able to identify the exact parameters, despite the non-convexity of the overall identification problem.
☆ Looking back and forward: A retrospective and future directions on Software Engineering for systems-of-systems
Modern systems are increasingly connected and more integrated with other existing systems, giving rise to systems-of-systems (SoS). An SoS consists of a set of independent, heterogeneous systems that interact to provide new functionalities and accomplish global missions through emergent behavior manifested at runtime. The distinctive characteristics of SoS, when contrasted to traditional systems, pose significant research challenges within Software Engineering. These challenges motivate the need for a paradigm shift and the exploration of novel approaches for designing, developing, deploying, and evolving these systems. The International Workshop on Software Engineering for Systems-of-Systems (SESoS) series started in 2013 to fill a gap in scientific forums addressing SoS from the Software Engineering perspective, becoming the first venue for this purpose. This article presents a study aimed at outlining the evolution and future trajectory of Software Engineering for SoS based on the examination of 57 papers spanning the 11 editions of the SESoS workshop (2013-2023). The study combined scoping review and scientometric analysis methods to categorize and analyze the research contributions concerning temporal and geographic distribution, topics of interest, research methodologies employed, application domains, and research impact. Based on such a comprehensive overview, this article discusses current and future directions in Software Engineering for SoS.
☆ A data-based comparison of methods for reducing the peak volume flow rate in a district heating system
This work concerns reduction of the peak flow rate of a district heating grid, a key system property which is bounded by pipe dimensions and pumping capacity. The peak flow rate constrains the number of additional consumers that can be connected, and may be a limiting factor in reducing supply temperatures when transitioning to the 4th generation of district heating. We evaluate a full year of operational data from a subset of customer meters in a district heating system in Germany. We consider the peak flow rate reduction that could be achieved with full a posteriori knowledge of this data. Three strategies for reducing the peak flow rate are investigated: A load shifting demand response strategy, an upper limitation in substation return temperatures, and an upper limitation on each substation's volume flow rate. We show that imposing up to to 18 % load flexibility for the customers provides an equal reduction in the peak system flow rate under the load shifting strategy. The limited return temperature strategy is less efficient at curtailing the peak flow rate, but provides an overall reduction of volume flow rates. Finally, the flow rate limitation method can introduce new, higher flow rate peaks, reducing performance.
☆ SIS epidemics on open networks: A replacement-based approximation
In this paper we analyze continuous-time SIS epidemics subject to arrivals and departures of agents, by using an approximated process based on replacements. In defining the SIS dynamics in an open network, we consider a stochastic setting in which arrivals and departures take place according to Poisson processes with similar rates, and the new value of the infection probability of an arriving agent is drawn from a continuous distribution. Since the system size changes with time, we define an approximated process, in which replacements take place instead of arrivals and departures, and we focus on the evolution of an aggregate measure of the level of infection. So long as the reproduction number is less than one, the long-term behavior of this function measures the impact of the changes of the set of agents in the epidemic. We derive upper bounds for the expectation and variance of this function and we include a numerical example to show that the approximated process is close to the original SIS process.
comment: 7 pages, 2 figures, to appear in European Control Conference (ECC 2024)
☆ Predictable Interval MDPs through Entropy Regularization
Regularization of control policies using entropy can be instrumental in adjusting predictability of real-world systems. Applications benefiting from such approaches range from, e.g., cybersecurity, which aims at maximal unpredictability, to human-robot interaction, where predictable behavior is highly desirable. In this paper, we consider entropy regularization for interval Markov decision processes (IMDPs). IMDPs are uncertain MDPs, where transition probabilities are only known to belong to intervals. Lately, IMDPs have gained significant popularity in the context of abstracting stochastic systems for control design. In this work, we address robust minimization of the linear combination of entropy and a standard cumulative cost in IMDPs, thereby establishing a trade-off between optimality and predictability. We show that optimal deterministic policies exist, and devise a value-iteration algorithm to compute them. The algorithm solves a number of convex programs at each step. Finally, through an illustrative example we show the benefits of penalizing entropy in IMDPs.
☆ BatDeck: Advancing Nano-drone Navigation with Low-power Ultrasound-based Obstacle Avoidance
Nano-drones, distinguished by their agility, minimal weight, and cost-effectiveness, are particularly well-suited for exploration in confined, cluttered and narrow spaces. Recognizing transparent, highly reflective or absorbing materials, such as glass and metallic surfaces is challenging, as classical sensors, such as cameras or laser rangers, often do not detect them. Inspired by bats, which can fly at high speeds in complete darkness with the help of ultrasound, this paper introduces \textit{BatDeck}, a pioneering sensor-deck employing a lightweight and low-power ultrasonic sensor for nano-drone autonomous navigation. This paper first provides insights about sensor characteristics, highlighting the influence of motor noise on the ultrasound readings, then it introduces the results of extensive experimental tests for obstacle avoidance (OA) in a diverse environment. Results show that \textit{BatDeck} allows exploration for a flight time of 8 minutes while covering 136m on average before crash in a challenging environment with transparent and reflective obstacles, proving the effectiveness of ultrasonic sensors for OA on nano-drones.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Trajectory Planning of Robotic Manipulator in Dynamic Environment Exploiting DRL
This study is about the implementation of a reinforcement learning algorithm in the trajectory planning of manipulators. We have a 7-DOF robotic arm to pick and place the randomly placed block at a random target point in an unknown environment. The obstacle is randomly moving which creates a hurdle in picking the object. The objective of the robot is to avoid the obstacle and pick the block with constraints to a fixed timestamp. In this literature, we have applied a deep deterministic policy gradient (DDPG) algorithm and compared the model's efficiency with dense and sparse rewards.
comment: Accepted in ICIESTR-2024
☆ Symbolic and User-friendly Geometric Algebra Routines (SUGAR) for Computations in Matlab
Geometric algebra (GA) is a mathematical tool for geometric computing, providing a framework that allows a unified and compact approach to geometric relations which in other mathematical systems are typically described using different more complicated elements. This fact has led to an increasing adoption of GA in applied mathematics and engineering problems. However, the scarcity of symbolic implementations of GA and its inherent complexity, requiring a specific mathematical background, make it challenging and less intuitive for engineers to work with. This prevents wider adoption among more applied professionals. To address this challenge, this paper introduces SUGAR (Symbolic and User-friendly Geometric Algebra Routines), an open-source toolbox designed for Matlab and licensed under the MIT License. SUGAR facilitates the translation of GA concepts into Matlab and provides a collection of user-friendly functions tailored for GA computations, including support for symbolic operations. It supports both numeric and symbolic computations in high-dimensional GAs. Specifically tailored for applied mathematics and engineering applications, SUGAR has been meticulously engineered to represent geometric elements and transformations within two and three-dimensional projective and conformal geometric algebras, aligning with established computational methodologies in the literature. Furthermore, SUGAR efficiently handles functions of multivectors, such as exponential, logarithmic, sinusoidal, and cosine functions, enhancing its applicability across various engineering domains, including robotics, control systems, and power electronics. Finally, this work includes four distinct validation examples, demonstrating SUGAR's capabilities across the above-mentioned fields and its practical utility in addressing real-world applied mathematics and engineering problems.
comment: 33 pages, 6 figures, journal paper submitted to ACM TOMS
☆ Guided Bayesian Optimization: Data-Efficient Controller Tuning with Digital Twin
This article presents the guided Bayesian optimization algorithm as an efficient data-driven method for iteratively tuning closed-loop controller parameters using an event-triggered digital twin of the system based on available closed-loop data. We define a controller tuning framework independent of the controller or the plant structure. Our proposed methodology is model-free, making it suitable for nonlinear and unmodelled plants with measurement noise. The objective function consists of performance metrics modeled by Gaussian processes. We utilize the available information in the closed-loop system to identify and progressively maintain a digital twin that guides the optimizer, improving the data efficiency of our method. Switching the digital twin on and off is triggered by data-driven criteria related to the digital twin's uncertainty estimations in the BO tuning framework. Effectively, it replaces much of the exploration of the real system with exploration performed on the digital twin. We analyze the properties of our method in simulation and demonstrate its performance on two real closed-loop systems with different plant and controller structures. The experimental results show that our method requires fewer experiments on the physical plant than Bayesian optimization to find the optimal controller parameters.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Optimal Operation of Reconfigurable Active Distribution Networks Aiming at Resiliency Improvement
As natural disasters bring about power outage and financial losses, network resiliency is an important challenge for distribution network operators (DNOs). On the other side, power loss reduction during normal operating condition is a major concern of DNOs. In this paper, optimal scheduling of active distribution network (ADN) is addressed through simultaneous minimization of power loss in normal condition and load shedding in critical condition after natural disasters. A new formulation is developed for the network reconfiguration to optimize the system operation in both normal and emergency conditions in the presence of conventional and renewable-energy-based distributed generation (DG) as well as energy storage systems (ESSs). The line flow based (LFB) algorithm is used for the AC power flow calculations, and all the developed relations have been convexified to construct a mixed-integer quadratically-constrained programming (MIQCP) optimization model. The simulations have been implemented on the IEEE 33-bus system in GAMS, and the results are investigated.
☆ Impact-Aware Bimanual Catching of Large-Momentum Objects
This paper investigates one of the most challenging tasks in dynamic manipulation -- catching large-momentum moving objects. Beyond the realm of quasi-static manipulation, dealing with highly dynamic objects can significantly improve the robot's capability of interacting with its surrounding environment. Yet, the inevitable motion mismatch between the fast moving object and the approaching robot will result in large impulsive forces, which lead to the unstable contacts and irreversible damage to both the object and the robot. To address the above problems, we propose an online optimization framework to: 1) estimate and predict the linear and angular motion of the object; 2) search and select the optimal contact locations across every surface of the object to mitigate impact through sequential quadratic programming (SQP); 3) simultaneously optimize the end-effector motion, stiffness, and contact force for both robots using multi-mode trajectory optimization (MMTO); and 4) realise the impact-aware catching motion on the compliant robotic system based on indirect force controller. We validate the impulse distribution, contact selection, and impact-aware MMTO algorithms in simulation and demonstrate the benefits of the proposed framework in real-world experiments including catching large-momentum moving objects with well-defined motion, constrained motion and free-flying motion.
☆ DASA: Delay-Adaptive Multi-Agent Stochastic Approximation
We consider a setting in which $N$ agents aim to speedup a common Stochastic Approximation (SA) problem by acting in parallel and communicating with a central server. We assume that the up-link transmissions to the server are subject to asynchronous and potentially unbounded time-varying delays. To mitigate the effect of delays and stragglers while reaping the benefits of distributed computation, we propose \texttt{DASA}, a Delay-Adaptive algorithm for multi-agent Stochastic Approximation. We provide a finite-time analysis of \texttt{DASA} assuming that the agents' stochastic observation processes are independent Markov chains. Significantly advancing existing results, \texttt{DASA} is the first algorithm whose convergence rate depends only on the mixing time $\tmix$ and on the average delay $\tau_{avg}$ while jointly achieving an $N$-fold convergence speedup under Markovian sampling. Our work is relevant for various SA applications, including multi-agent and distributed temporal difference (TD) learning, Q-learning and stochastic optimization with correlated data.
☆ A Discrete-Time Least-Squares Adaptive State Tracking Control Scheme with A Mobile-Robot System Study
This paper develops an adaptive state tracking control scheme for discrete-time systems, using the least-squares algorithm, as the new solution to the long-standing discrete-time adaptive state tracking control problem to which the Lyapunov method (well-developed for the continuous-time adaptive state tracking problem) is not applicable. The new adaptive state tracking scheme is based on a recently-developed new discrete-time error model which has been used for gradient algorithm based state tracking control schemes, and uses the least-squares algorithm for parameter adaptation. The new least-squares algorithm is derived to minimize an accumulative estimation error, to ensure certain optimality for parameter estimation. The system stability and output tracking properties are studied. Technical results are presented in terms of plant-model matching, error model, adaptive law, optimality formulation, and stability and tracking analysis. The developed adaptive control scheme is applied to a discrete-time multiple mobile robot system to meet an adaptive state tracking objective. In addition, a collision avoidance mechanism is proposed to prevent collisions in the whole tracking process. Simulation results are presented, which verify the desired system state tracking properties under the developed least-squares algorithm based adaptive control scheme.
☆ Active Learning of Dynamics Using Prior Domain Knowledge in the Sampling Process
We present an active learning algorithm for learning dynamics that leverages side information by explicitly incorporating prior domain knowledge into the sampling process. Our proposed algorithm guides the exploration toward regions that demonstrate high empirical discrepancy between the observed data and an imperfect prior model of the dynamics derived from side information. Through numerical experiments, we demonstrate that this strategy explores regions of high discrepancy and accelerates learning while simultaneously reducing model uncertainty. We rigorously prove that our active learning algorithm yields a consistent estimate of the underlying dynamics by providing an explicit rate of convergence for the maximum predictive variance. We demonstrate the efficacy of our approach on an under-actuated pendulum system and on the half-cheetah MuJoCo environment.
☆ High-dimensional continuification control of large-scale multi-agent systems under limited sensing and perturbations
This paper investigates the robustness of a novel high-dimensional continuification control method for complex multi-agent systems. We begin by formulating a partial differential equation describing the spatio-temporal density dynamics of swarming agents. A stable control action for the density is then derived and validated under nominal conditions. Subsequently, we discretize this macroscopic strategy into actionable velocity inputs for the system's agents. Our analysis demonstrates the robustness of the approach beyond idealized assumptions of unlimited sensing and absence of perturbations.
comment: arXiv admin note: text overlap with arXiv:2310.01573
☆ Robust Finite-time Stabilization of Linear Systems with Limited State Quantization
This paper investigates the robust asymptotic stabilization of a linear time-invariant (LTI) system by a static feedback with a static state quantization. It is shown that the controllable LTI system can be stabilized to zero in a finite time by means of a nonlinear feedback with a quantizer having a limited (finite) number of values (quantization seeds) even when all parameters of the controller and the quantizer are time-invariant. The control design is based on generalized homogeneity. A homogeneous spherical quantizer is introduced. The static homogeneous feedback is shown to be local (or global) finite-time stabilizer for the linear system (dependently of the system matrix). The tuning rules for both the quantizer and the feedback law are obtained in the form of Linear Matrix Inequalities (LMIs). The closed-loop system is proven to be robust with respect to some bounded matched and vanishing mismatched perturbations. Theoretical results are supported by numerical simulations. \
☆ Belief Samples Are All You Need For Social Learning
In this paper, we consider the problem of social learning, where a group of agents embedded in a social network are interested in learning an underlying state of the world. Agents have incomplete, noisy, and heterogeneous sources of information, providing them with recurring private observations of the underlying state of the world. Agents can share their learning experience with their peers by taking actions observable to them, with values from a finite feasible set of states. Actions can be interpreted as samples from the beliefs which agents may form and update on what the true state of the world is. Sharing samples, in place of full beliefs, is motivated by the limited communication, cognitive, and information-processing resources available to agents especially in large populations. Previous work (Salhab et al.) poses the question as to whether learning with probability one is still achievable if agents are only allowed to communicate samples from their beliefs. We provide a definite positive answer to this question, assuming a strongly connected network and a ``collective distinguishability'' assumption, which are both required for learning even in full-belief-sharing settings. In our proposed belief update mechanism, each agent's belief is a normalized weighted geometric interpolation between a fully Bayesian private belief -- aggregating information from the private source -- and an ensemble of empirical distributions of the samples shared by her neighbors over time. By carefully constructing asymptotic almost-sure lower/upper bounds on the frequency of shared samples matching the true state/or not, we rigorously prove the convergence of all the beliefs to the true state, with probability one.
comment: 6 pages
☆ Output-feedback Synthesis Orbit Geometry: Quotient Manifolds and LQG Direct Policy Optimization
In this paper, we consider direct policy optimization for the linear-quadratic Gaussian (LQG) setting. Over the past few years, it has been recognized that the landscape of stabilizing output-feedback controllers of relevance to LQG has an intricate geometry, particularly as it pertains to the existence of spurious stationary points. In order to address such challenges, in this paper, we first adopt a Riemannian metric for the space of stabilizing full-order minimal output-feedback controllers. We then proceed to prove that the orbit of such controllers modulo coordinate transformation admits a Riemannian quotient manifold structure. This geometric structure is then used to develop a Riemannian gradient descent for the direct LQG policy optimization. We prove a local convergence guarantee with linear rate and show the proposed approach exhibits significantly faster and more robust numerical performance as compared with ordinary gradient descent for LQG. Subsequently, we provide reasons for this observed behavior; in particular, we argue that optimizing over the orbit space of controllers is the right theoretical and computational setup for direct LQG policy optimization.
☆ Approximation with Random Shallow ReLU Networks with Applications to Model Reference Adaptive Control
Neural networks are regularly employed in adaptive control of nonlinear systems and related methods o reinforcement learning. A common architecture uses a neural network with a single hidden layer (i.e. a shallow network), in which the weights and biases are fixed in advance and only the output layer is trained. While classical results show that there exist neural networks of this type that can approximate arbitrary continuous functions over bounded regions, they are non-constructive, and the networks used in practice have no approximation guarantees. Thus, the approximation properties required for control with neural networks are assumed, rather than proved. In this paper, we aim to fill this gap by showing that for sufficiently smooth functions, ReLU networks with randomly generated weights and biases achieve $L_{\infty}$ error of $O(m^{-1/2})$ with high probability, where $m$ is the number of neurons. It suffices to generate the weights uniformly over a sphere and the biases uniformly over an interval. We show how the result can be used to get approximations of required accuracy in a model reference adaptive control application.
comment: Under Review for Conference on Decision and Control
☆ Adaptive Step Duration for Precise Foot Placement: Achieving Robust Bipedal Locomotion on Terrains with Restricted Footholds
This paper introduces a novel multi-step preview foot placement planning algorithm designed to enhance the robustness of bipedal robotic walking across challenging terrains with restricted footholds. Traditional one-step preview planning struggles to maintain stability when stepping areas are severely limited, such as with random stepping stones. In this work, we developed a discrete-time Model Predictive Control (MPC) based on the step-to-step discrete evolution of the Divergent Component of Motion (DCM) of bipedal locomotion. This approach adaptively changes the step duration for optimal foot placement under constraints, thereby ensuring the robot's operational viability over multiple future steps and significantly improving its ability to navigate through environments with tight constraints on possible footholds. The effectiveness of this planning algorithm is demonstrated through simulations that include a variety of complex stepping-stone configurations and external perturbations. These tests underscore the algorithm's improved performance for navigating foothold-restricted environments, even with the presence of external disturbances.
comment: 8 pages, 8 figures, submitted to CDC 2024, for associated simulation video, see https://youtu.be/2jhikPlZmbE
☆ State-Augmented Linear Games with Antagonistic Error for High-Dimensional, Nonlinear Hamilton-Jacobi Reachability
Hamilton-Jacobi Reachability (HJR) is a popular method for analyzing the liveness and safety of a dynamical system with bounded control and disturbance. The corresponding HJ value function offers a robust controller and characterizes the reachable sets, but is traditionally solved with Dynamic Programming (DP) and limited to systems of dimension less than six. Recently, the space-parallelizeable, generalized Hopf formula has been shown to also solve the HJ value with a nearly three-log increase in dimension limit, but is limited to linear systems. To extend this potential, we demonstrate how state-augmented (SA) spaces, which are well-known for their improved linearization accuracy, may be used to solve tighter, conservative approximations of the value function with any linear model in this SA space. Namely, we show that with a representation of the true dynamics in the SA space, a series of inequalities confirms that the value of a SA linear game with antagonistic error is a conservative envelope of the true value function. It follows that if the optimal controller for the HJ SA linear game with error may succeed, it will also succeed in the true system. Unlike previous methods, this result offers the ability to safely approximate reachable sets and their corresponding controllers with the Hopf formula in a non-convex manner. Finally, we demonstrate this in the slow manifold system for clarity, and in the controlled Van der Pol system with different lifting functions.
☆ An Optimal Solution to Infinite Horizon Nonlinear Control Problems: Part II
This paper considers the infinite horizon optimal control problem for nonlinear systems. Under the condition of nonlinear controllability of the system to any terminal set containing the origin and forward invariance of the terminal set, we establish a regularized solution approach consisting of a ``finite free final time" optimal transfer problem to the terminal set which renders the set globally asymptotically stable. Further, we show that the approximations converge to the optimal infinite horizon cost as the size of the terminal set decreases to zero. We also perform the analysis for the discounted problem and show that the terminal set is asymptotically stable only for a subset of the state space and not globally. The theory is empirically evaluated on various nonholonomic robotic systems to show that the cost of our approximate problem converges and the transfer time into the terminal set is dependent on the initial state of the system, necessitating the free final time formulation.
☆ Bayesian Methods for Trust in Collaborative Multi-Agent Autonomy
Multi-agent, collaborative sensor fusion is a vital component of a multi-national intelligence toolkit. In safety-critical and/or contested environments, adversaries may infiltrate and compromise a number of agents. We analyze state of the art multi-target tracking algorithms under this compromised agent threat model. We prove that the track existence probability test ("track score") is significantly vulnerable to even small numbers of adversaries. To add security awareness, we design a trust estimation framework using hierarchical Bayesian updating. Our framework builds beliefs of trust on tracks and agents by mapping sensor measurements to trust pseudomeasurements (PSMs) and incorporating prior trust beliefs in a Bayesian context. In case studies, our trust estimation algorithm accurately estimates the trustworthiness of tracks/agents, subject to observability limitations.
☆ Spline Trajectory Tracking and Obstacle Avoidance for Mobile Agents via Convex Optimization
We propose an output feedback control-based motion planning technique for agents to enable them to converge to a specified polynomial trajectory while imposing a set of safety constraints on our controller to avoid collisions within the free configuration space (polygonal environment). To achieve this, we 1) decompose our polygonal environment into different overlapping cells 2) write out our polynomial trajectories as the output of a reference dynamical system with given initial conditions 3) formulate convergence and safety constraints as Linear Matrix Inequalities (LMIs) on our controller using Control Lyapunov Functions (CLFs) and Control Barrier Functions (CBFs) and 4) solve a semi-definite programming (SDP) problem with convergence and safety constraints imposed to synthesize a controller for each convex cell. Extensive simulations are included to test our motion planning method under different initial conditions and different reference trajectories. The synthesized controller is robust to changes in initial conditions and is always safe relative to the boundaries of the polygonal environment.
☆ State Space Models as Foundation Models: A Control Theoretic Overview
In recent years, there has been a growing interest in integrating linear state-space models (SSM) in deep neural network architectures of foundation models. This is exemplified by the recent success of Mamba, showing better performance than the state-of-the-art Transformer architectures in language tasks. Foundation models, like e.g. GPT-4, aim to encode sequential data into a latent space in order to learn a compressed representation of the data. The same goal has been pursued by control theorists using SSMs to efficiently model dynamical systems. Therefore, SSMs can be naturally connected to deep sequence modeling, offering the opportunity to create synergies between the corresponding research areas. This paper is intended as a gentle introduction to SSM-based architectures for control theorists and summarizes the latest research developments. It provides a systematic review of the most successful SSM proposals and highlights their main features from a control theoretic perspective. Additionally, we present a comparative analysis of these models, evaluating their performance on a standardized benchmark designed for assessing a model's efficiency at learning long sequences.
☆ A Semi-Lagrangian Approach for Time and Energy Path Planning Optimization in Static Flow Fields
Efficient path planning for autonomous mobile robots is a critical problem across numerous domains, where optimizing both time and energy consumption is paramount. This paper introduces a novel methodology that considers the dynamic influence of an environmental flow field and considers geometric constraints, including obstacles and forbidden zones, enriching the complexity of the planning problem. We formulate it as a multi-objective optimal control problem, propose a novel transformation called Harmonic Transformation, and apply a semi-Lagrangian scheme to solve it. The set of Pareto efficient solutions is obtained considering two distinct approaches: a deterministic method and an evolutionary-based one, both of which are designed to make use of the proposed Harmonic Transformation. Through an extensive analysis of these approaches, we demonstrate their efficacy in finding optimized paths.
comment: 12 pages, initial paper submission; Preprint submitted to the IEEE Transactions on Intelligent Transportation Systems
☆ Semantic-Aware Remote Estimation of Multiple Markov Sources Under Constraints
This paper studies semantic-aware communication for remote estimation of multiple Markov sources over a lossy and rate-constrained channel. Unlike most existing studies that treat all source states equally, we exploit the semantics of information and consider that the remote actuator has different tolerances for the estimation errors of different states. We aim to find an optimal scheduling policy that minimizes the long-term state-dependent costs of estimation errors under a transmission frequency constraint. We theoretically show the structure of the optimal policy by leveraging the average-cost Constrained Markov Decision Process (CMDP) theory and the Lagrangian dynamic programming. By exploiting the optimal structural results, we develop a novel policy search algorithm, termed intersection search plus relative value iteration (Insec-RVI), that can find the optimal policy using only a few iterations. To avoid the ``curse of dimensionality'' of MDPs, we propose an online low-complexity drift-plus-penalty (DPP) scheduling algorithm based on the Lyapunov optimization theorem. We also design an efficient average-cost Q-learning algorithm to estimate the optimal policy without knowing a priori the channel and source statistics. Numerical results show that continuous transmission is inefficient, and remarkably, our semantic-aware policies can attain the optimum by strategically utilizing fewer transmissions by exploiting the timing of the important information.
☆ Energy Efficiency Optimization Method of WDM Visible Light Communication System for Indoor Broadcasting Networks
This paper introduces a novel approach to optimize energy efficiency in wavelength division multiplexing (WDM) Visible Light Communication (VLC) systems designed for indoor broadcasting networks. A physics-based LED model is integrated into system energy efficiency optimization, enabling quantitative analysis of the critical issue of VLC energy efficiency: the nonlinear interplay between illumination and communication performance. The optimization jointly incorporates constraints on communication quality of each channel, and illumination performance, standardized by the International Commission on Illumination (CIE). The formulated nonlinear optimization problem is solved by the Sequential Quadratic Programming (SQP) algorithm in an experiment-based simulation. An integrated Red-Green-Blue-Yellow Light Emitting Diode (RGBY-LED) is measured for model calibration and three different scenarios are simulated to evaluate the generality of the proposed method. Results demonstrate a double enhancement in performance and a high versatility in accommodating various scenarios. Furthermore, it highlights the importance of balancing communication and illumination imperatives in VLC systems, challenging conventional perceptions focused solely on minimizing power consumption.
☆ Resource and Mobility Management in Hybrid LiFi and WiFi Networks: A User-Centric Learning Approach
Hybrid light fidelity (LiFi) and wireless fidelity (WiFi) networks (HLWNets) are an emerging indoor wireless communication paradigm, which combines the advantages of the capacious optical spectra of LiFi and ubiquitous coverage of WiFi. Meanwhile, load balancing (LB) becomes a key challenge in resource management for such hybrid networks. The existing LB methods are mostly network-centric, relying on a central unit to make a solution for the users all at once. Consequently, the solution needs to be updated for all users at the same pace, regardless of their moving status. This would affect the network performance in two aspects: i) when the update frequency is low, it would compromise the connectivity of fast-moving users; ii) when the update frequency is high, it would cause unnecessary handovers as well as hefty feedback costs for slow-moving users. Motivated by this, we investigate user-centric LB which allows users to update their solutions at different paces. The research is developed upon our previous work on adaptive target-condition neural network (ATCNN), which can conduct LB for individual users in quasi-static channels. In this paper, a deep neural network (DNN) model is designed to enable an adaptive update interval for each individual user. This new model is termed as mobility-supporting neural network (MSNN). Associating MSNN with ATCNN, a user-centric LB framework named mobility-supporting ATCNN (MS-ATCNN) is proposed to handle resource management and mobility management simultaneously. Results show that at the same level of average update interval, MS-ATCNN can achieve a network throughput up to 215\% higher than conventional LB methods such as game theory, especially for a larger number of users. In addition, MS-ATCNN costs an ultra low runtime at the level of 100s $\mu$s, which is two to three orders of magnitude lower than game theory.
comment: 12 pages, 12 figures, 3 tables, submitted to IEEE TWC
☆ Scheduling Power-Intensive Operations of Battery Energy Storage Systems and Application to Hybrid Hydropower Plants
Classical schedulers for Battery Energy Storage Systems (BESSs) use static power constraints, assuming that the BESS can provide the rated power at any State-Of-Charge (SOC) level and that these are representative of the underlying physical constraints of the system (BESS voltage and current). Static power constraints, however, can generate unfeasible schedules, especially in power-intensive applications, as demonstrated in this paper. This paper derives a set of alternative constraints for the BESS power that are cognizant of the physical limits of the BESS. It is shown that these constraints, developed by leveraging an equivalent circuit model of the BESS, can be formulated as linear inequalities in scheduling problems, thus leaving the properties of the original problem (i.e., linearity and convexity) unaltered. A comparative analysis against traditional schedulers from the literature shows significant reductions in current violations and the generation of feasible schedules. These findings underscore the crucial role of implementing more advanced power constraints of BESSs in power-intensive applications, thereby enhancing the reliability of BESS scheduling strategies.
☆ An LLM-Based Digital Twin for Optimizing Human-in-the Loop Systems
The increasing prevalence of Cyber-Physical Systems and the Internet of Things (CPS-IoT) applications and Foundation Models are enabling new applications that leverage real-time control of the environment. For example, real-time control of Heating, Ventilation and Air-Conditioning (HVAC) systems can reduce its usage when not needed for the comfort of human occupants, hence reducing energy consumption. Collecting real-time feedback on human preferences in such human-in-the-loop (HITL) systems, however, is difficult in practice. We propose the use of large language models (LLMs) to deal with the challenges of dynamic environments and difficult-to-obtain data in CPS optimization. In this paper, we present a case study that employs LLM agents to mimic the behaviors and thermal preferences of various population groups (e.g. young families, the elderly) in a shopping mall. The aggregated thermal preferences are integrated into an agent-in-the-loop based reinforcement learning algorithm AitL-RL, which employs the LLM as a dynamic simulation of the physical environment to learn how to balance between energy savings and occupant comfort. Our results show that LLMs are capable of simulating complex population movements within large open spaces. Besides, AitL-RL demonstrates superior performance compared to the popular existing policy of set point control, suggesting that adaptive and personalized decision-making is critical for efficient optimization in CPS-IoT applications. Through this case study, we demonstrate the potential of integrating advanced Foundation Models like LLMs into CPS-IoT to enhance system adaptability and efficiency. The project's code can be found on our GitHub repository.
comment: Accepted at International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys) 2024, Co-located at CPS-IoT Week 2024
☆ Privacy Preservation by Intermittent Transmission in Cooperative LQG Control Systems
In this paper, we study a cooperative linear quadratic Gaussian (LQG) control system with a single user and a server. In this system, the user runs a process and employs the server to meet the needs of computation. However, the user regards its state trajectories as privacy. Therefore, we propose a privacy scheme, in which the user sends data to the server intermittently. By this scheme, the server's received information of the user is reduced, and consequently the user's privacy is preserved. In this paper, we consider a periodic transmission scheme. We analyze the performance of privacy preservation and LQG control of different transmission periods. Under the given threshold of the control performance loss, a trade-off optimization problem is proposed. Finally, we give the solution to the optimization problem.
☆ DBPF: A Framework for Efficient and Robust Dynamic Bin-Picking
Efficiency and reliability are critical in robotic bin-picking as they directly impact the productivity of automated industrial processes. However, traditional approaches, demanding static objects and fixed collisions, lead to deployment limitations, operational inefficiencies, and process unreliability. This paper introduces a Dynamic Bin-Picking Framework (DBPF) that challenges traditional static assumptions. The DBPF endows the robot with the reactivity to pick multiple moving arbitrary objects while avoiding dynamic obstacles, such as the moving bin. Combined with scene-level pose generation, the proposed pose selection metric leverages the Tendency-Aware Manipulability Network optimizing suction pose determination. Heuristic task-specific designs like velocity-matching, dynamic obstacle avoidance, and the resight policy, enhance the picking success rate and reliability. Empirical experiments demonstrate the importance of these components. Our method achieves an average 84% success rate, surpassing the 60% of the most comparable baseline, crucially, with zero collisions. Further evaluations under diverse dynamic scenarios showcase DBPF's robust performance in dynamic bin-picking. Results suggest that our framework offers a promising solution for efficient and reliable robotic bin-picking under dynamics.
comment: 8 pages, 5 figures. This paper has been accepted by IEEE RA-L on 2024-03-24. See the supplementary video at youtube: https://youtu.be/n5af2VsKhkg
☆ Policy Gradient-based Model Free Optimal LQG Control with a Probabilistic Risk Constraint
In this paper, we investigate a model-free optimal control design that minimizes an infinite horizon average expected quadratic cost of states and control actions subject to a probabilistic risk or chance constraint using input-output data. In particular, we consider linear time-invariant systems and design an optimal controller within the class of linear state feedback control. Three different policy gradient (PG) based algorithms, natural policy gradient (NPG), Gauss-Newton policy gradient (GNPG), and deep deterministic policy gradient (DDPG), are developed, and compared with the optimal risk-neutral linear-quadratic regulator (LQR) and a scenario-based model predictive control (MPC) technique via numerical simulations. The convergence properties and the accuracy of all the algorithms are compared numerically. We also establish analytical convergence properties of the NPG and GNPG algorithms under the known model scenario, while the proof of convergence for the unknown model scenario is part of our ongoing work.
comment: Submitted to IEEE CDC2024
☆ A Blotto Game Approach to Ride-hailing Markets with Electric Vehicles
When a centrally operated ride-hailing company considers to enter a market already served by another company, it has to make a strategic decision about how to distribute its fleet among different regions in the area. This decision will be influenced by the market share the company can secure and the costs associated with charging the vehicles in each region, all while competing with the company already operating in the area. In this paper, we propose a Colonel Blotto-like game to model this decision-making. For the class of games that we study, we first prove the existence and uniqueness of a Nash Equilibrium. Subsequently, we provide its general characterization and present an algorithm for computing the ones in the feasible set's interior. Additionally, for a simplified scenario involving two regions, which would correspond to a city area with a downtown and a suburban region, we also provide a method to check for the equilibria on the feasible set's boundary. Finally, through a numerical case study, we illustrate the impact of charging prices on the position of the Nash equilibrium.
comment: Extended version of the paper accepted for presentation at the 2024 European Control Conference (ECC2024)
☆ The Adaptive Workplace: Orchestrating Architectural Services around the Wellbeing of Individual Occupants
As the academic consortia members of the EU Horizon project SONATA ("Situation-aware OrchestratioN of AdapTive Architecture"), we respond to the workshop call for "Office Wellbeing by Design: Don't Stand for Anything Less" by proposing the "Adaptive Workplace" concept. In essence, our vision aims to adapt a workplace to the ever-changing needs of individual occupants, instead of that occupants are expected to adapt to their workplace.
☆ Counter-example guided Imitation Learning of Feedback Controllers from Temporal Logic Specifications
We present a novel method for imitation learning for control requirements expressed using Signal Temporal Logic (STL). More concretely we focus on the problem of training a neural network to imitate a complex controller. The learning process is guided by efficient data aggregation based on counter-examples and a coverage measure. Moreover, we introduce a method to evaluate the performance of the learned controller via parameterization and parameter estimation of the STL requirements. We demonstrate our approach with a flying robot case study.
☆ Sparsity-Constrained Linear Quadratic Regulation Problem: Greedy Approach with Performance Guarantee
We study a linear quadratic regulation problem with a constraint where the control input can be nonzero only at a limited number of times. Given that this constraint leads to a combinational optimization problem, we adopt a greedy method to find a suboptimal solution. To quantify the performance of the greedy algorithm, we employ two metrics that reflect the submodularity level of the objective function: The submodularity ratio and curvature. We first present an explicit form of the optimal control input that is amenable to evaluating these metrics. Subsequently, we establish bounds on the submodularity ratio and curvature, which enable us to offer a practical performance guarantee for the greedy algorithm. The effectiveness of our guarantee is further demonstrated through numerical simulations.
comment: 8 pages, 3 figures
☆ Decoupling parameter variation from noise: Biquadratic Lyapunov forms in data-driven LPV control
A promising step from linear towards nonlinear data-driven control is via the design of controllers for linear parameter-varying (LPV) systems, which are linear systems whose parameters are varying along a measurable scheduling signal. However, the interplay between uncertainty arising from corrupted data and the parameter-varying nature of these systems impacts the stability analysis, and limits the generalization of well-understood data-driven methods for linear time-invariant systems. In this work, we decouple this interplay using a recently developed variant of the Fundamental Lemma for LPV systems and the viewpoint of data-informativity, in combination with biquadratic Lyapunov forms. Together, these allow us to develop novel linear matrix inequality conditions for the existence of scheduling-dependent Lyapunov functions, incorporating the intrinsic nonlinearity. Appealingly, these results are stated purely in terms of the collected data and bounds on the noise, and they are computationally favorable to check.
comment: Submitted for CDC 2024
☆ Hybrid low-dimensional limiting state of charge estimator for multi-cell lithium-ion batteries
The state of charge (SOC) of lithium-ion batteries needs to be accurately estimated for safety and reliability purposes. For battery packs made of a large number of cells, it is not always feasible to design one SOC estimator per cell due to limited computational resources. Instead, only the minimum and the maximum SOC need to be estimated. The challenge is that the cells having minimum and maximum SOC typically change over time. In this context, we present a low-dimensional hybrid estimator of the minimum (maximum) SOC, whose convergence is analytically guaranteed. We consider for this purpose a battery consisting of cells interconnected in series, which we model by electric equivalent circuit models. We then present the hybrid estimator, which runs an observer designed for a single cell at any time instant, selected by a switching-like logic mechanism. We establish a practical exponential stability property for the estimation error on the minimum (maximum) SOC thereby guaranteeing the ability of the hybrid scheme to generate accurate estimates of the minimum (maximum) SOC. The analysis relies on non-smooth hybrid Lyapunov techniques. A numerical illustration is provided to showcase the relevance of the proposed approach.
☆ Spatially temporally distributed informative path planning for multi-robot systems
This paper investigates the problem of informative path planning for a mobile robotic sensor network in spatially temporally distributed mapping. The robots are able to gather noisy measurements from an area of interest during their movements to build a Gaussian Process (GP) model of a spatio-temporal field. The model is then utilized to predict the spatio-temporal phenomenon at different points of interest. To spatially and temporally navigate the group of robots so that they can optimally acquire maximal information gains while their connectivity is preserved, we propose a novel multistep prediction informative path planning optimization strategy employing our newly defined local cost functions. By using the dual decomposition method, it is feasible and practical to effectively solve the optimization problem in a distributed manner. The proposed method was validated through synthetic experiments utilizing real-world data sets.
☆ Data-Driven Extrusion Force Control Tuning for 3D Printing
The quality of 3D prints often varies due to different conditions inherent to each print, such as filament type, print speed, and nozzle size. Closed-loop process control methods improve the accuracy and repeatability of 3D prints. However, optimal tuning of controllers for given process parameters and design geometry is often a challenge with manually tuned controllers resulting in inconsistent and suboptimal results. This work employs Bayesian optimization to identify the optimal controller parameters. Additionally, we explore transfer learning in the context of 3D printing by leveraging prior information from past trials. By integrating optimized extrusion force control and transfer learning, we provide a novel framework for closed-loop 3D printing and propose an automated calibration routine that produces high-quality prints for a desired combination of print settings, material, and shape.
comment: Submitted to IEEE CASE 2024
☆ A Geometric Perspective on Fusing Gaussian Distributions on Lie Groups
Stochastic inference on Lie groups plays a key role in state estimation problems, such as inertial navigation, visual inertial odometry, pose estimation in virtual reality, etc. A key problem is fusing independent concentrated Gaussian distributions defined at different reference points on the group. In this paper we approximate distributions at different points in the group in a single set of exponential coordinates and then use classical Gaussian fusion to obtain the fused posteriori in those coordinates. We consider several approximations including the exact Jacobian of the change of coordinate map, first and second order Taylor's expansions of the Jacobian, and parallel transport with and without curvature correction associated with the underlying geometry of the Lie group. Preliminary results on SO(3) demonstrate that a novel approximation using parallel transport with curvature correction achieves similar accuracy to the state-of-the-art optimisation based algorithms at a fraction of the computational cost.
comment: Preprint for L-CSS
☆ A Distributionally Robust Model Predictive Control for Static and Dynamic Uncertainties in Smart Grids
The integration of various power sources, including renewables and electric vehicles, into smart grids is expanding, introducing uncertainties that can result in issues like voltage imbalances, load fluctuations, and power losses. These challenges negatively impact the reliability and stability of online scheduling in smart grids. Existing research often addresses uncertainties affecting current states but overlooks those that impact future states, such as the unpredictable charging patterns of electric vehicles. To distinguish between these, we term them static uncertainties and dynamic uncertainties, respectively. This paper introduces WDR-MPC, a novel approach that stands for two-stage Wasserstein-based Distributionally Robust (WDR) optimization within a Model Predictive Control (MPC) framework, aimed at effectively managing both types of uncertainties in smart grids. The dynamic uncertainties are first reformulated into ambiguity tubes and then the distributionally robust bounds of both dynamic and static uncertainties can be established using WDR optimization. By employing ambiguity tubes and WDR optimization, the stochastic MPC system is converted into a nominal one. Moreover, we develop a convex reformulation method to speed up WDR computation during the two-stage optimization. The distinctive contribution of this paper lies in its holistic approach to both static and dynamic uncertainties in smart grids. Comprehensive experiment results on IEEE 38-bus and 94-bus systems reveal the method's superior performance and the potential to enhance grid stability and reliability.
☆ Physics-informed RL for Maximal Safety Probability Estimation
Accurate risk quantification and reachability analysis are crucial for safe control and learning, but sampling from rare events, risky states, or long-term trajectories can be prohibitively costly. Motivated by this, we study how to estimate the long-term safety probability of maximally safe actions without sufficient coverage of samples from risky states and long-term trajectories. The use of maximal safety probability in control and learning is expected to avoid conservative behaviors due to over-approximation of risk. Here, we first show that long-term safety probability, which is multiplicative in time, can be converted into additive costs and be solved using standard reinforcement learning methods. We then derive this probability as solutions of partial differential equations (PDEs) and propose Physics-Informed Reinforcement Learning (PIRL) algorithm. The proposed method can learn using sparse rewards because the physics constraints help propagate risk information through neighbors. This suggests that, for the purpose of extracting more information for efficient learning, physics constraints can serve as an alternative to reward shaping. The proposed method can also estimate long-term risk using short-term samples and deduce the risk of unsampled states. This feature is in stark contrast with the unconstrained deep RL that demands sufficient data coverage. These merits of the proposed method are demonstrated in numerical simulation.
☆ Real-time Adaptation for Condition Monitoring Signal Prediction using Label-aware Neural Processes
Building a predictive model that rapidly adapts to real-time condition monitoring (CM) signals is critical for engineering systems/units. Unfortunately, many current methods suffer from a trade-off between representation power and agility in online settings. For instance, parametric methods that assume an underlying functional form for CM signals facilitate efficient online prediction updates. However, this simplification leads to vulnerability to model specifications and an inability to capture complex signals. On the other hand, approaches based on over-parameterized or non-parametric models can excel at explaining complex nonlinear signals, but real-time updates for such models pose a challenging task. In this paper, we propose a neural process-based approach that addresses this trade-off. It encodes available observations within a CM signal into a representation space and then reconstructs the signal's history and evolution for prediction. Once trained, the model can encode an arbitrary number of observations without requiring retraining, enabling on-the-spot real-time predictions along with quantified uncertainty and can be readily updated as more online data is gathered. Furthermore, our model is designed to incorporate qualitative information (i.e., labels) from individual units. This integration not only enhances individualized predictions for each unit but also enables joint inference for both signals and their associated labels. Numerical studies on both synthetic and real-world data in reliability engineering highlight the advantageous features of our model in real-time adaptation, enhanced signal prediction with uncertainty quantification, and joint prediction for labels and signals.
♻ ☆ Peak Estimation of Rational Systems using Convex Optimization
This paper presents algorithms that upper-bound the peak value of a state function along trajectories of a continuous-time system with rational dynamics. The finite-dimensional but nonconvex peak estimation problem is cast as a convex infinite-dimensional linear program in occupation measures. This infinite-dimensional program is then truncated into finite-dimensions using the moment-Sum-of-Squares (SOS) hierarchy of semidefinite programs. Prior work on treating rational dynamics using the moment-SOS approach involves clearing dynamics to common denominators or adding lifting variables to handle reciprocal terms under new equality constraints. Our solution method uses a sum-of-rational method based on absolute continuity of measures. The Moment-SOS truncations of our program possess lower computational complexity and (empirically demonstrated) higher accuracy of upper bounds on example systems as compared to prior approaches.
comment: 9 pages, 2 figures, 4 tables
♻ ☆ Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling AISTATS
Motivated by applications in large-scale and multi-agent reinforcement learning, we study the non-asymptotic performance of stochastic approximation (SA) schemes with delayed updates under Markovian sampling. While the effect of delays has been extensively studied for optimization, the manner in which they interact with the underlying Markov process to shape the finite-time performance of SA remains poorly understood. In this context, our first main contribution is to show that under time-varying bounded delays, the delayed SA update rule guarantees exponentially fast convergence of the \emph{last iterate} to a ball around the SA operator's fixed point. Notably, our bound is \emph{tight} in its dependence on both the maximum delay $\tau_{max}$, and the mixing time $\tau_{mix}$. To achieve this tight bound, we develop a novel inductive proof technique that, unlike various existing delayed-optimization analyses, relies on establishing uniform boundedness of the iterates. As such, our proof may be of independent interest. Next, to mitigate the impact of the maximum delay on the convergence rate, we provide the first finite-time analysis of a delay-adaptive SA scheme under Markovian sampling. In particular, we show that the exponent of convergence of this scheme gets scaled down by $\tau_{avg}$, as opposed to $\tau_{max}$ for the vanilla delayed SA rule; here, $\tau_{avg}$ denotes the average delay across all iterations. Moreover, the adaptive scheme requires no prior knowledge of the delay sequence for step-size tuning. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms, including TD learning, Q-learning, and stochastic gradient descent under Markovian sampling.
comment: Accepted to the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024!
♻ ☆ Fault Localization for Buggy Deep Learning Framework Conversions in Image Recognition
When deploying Deep Neural Networks (DNNs), developers often convert models from one deep learning framework to another (e.g., TensorFlow to PyTorch). However, this process is error-prone and can impact target model accuracy. To identify the extent of such impact, we perform and briefly present a differential analysis against three DNNs widely used for image recognition (MobileNetV2, ResNet101, and InceptionV3) converted across four well-known deep learning frameworks (PyTorch, Keras, TensorFlow (TF), and TFLite), which revealed numerous model crashes and output label discrepancies of up to 100%. To mitigate such errors, we present a novel approach towards fault localization and repair of buggy deep learning framework conversions, focusing on pre-trained image recognition models. Our technique consists of four stages of analysis: 1) conversion tools, 2) model parameters, 3) model hyperparameters, and 4) graph representation. In addition, we propose various strategies towards fault repair of the faults detected. We implement our technique on top of the Apache TVM deep learning compiler, and we test it by conducting a preliminary fault localization analysis for the conversion of InceptionV3 from TF to TFLite. Our approach detected a fault in a common DNN converter tool, which introduced precision errors in weights, reducing model accuracy. After our fault localization, we repaired the issue, reducing our conversion error to zero.
comment: 5 pages, 3 figures, 1 table
♻ ☆ DeltaNN: Assessing the Impact of Computational Environment Parameters on the Performance of Image Recognition Models
Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and TPUs for fast, timely processing. Failure in real-time image recognition tasks can occur due to sub-optimal mapping on hardware accelerators during model deployment, which may lead to timing uncertainty and erroneous behavior. Mapping on hardware accelerators is done using multiple software components like deep learning frameworks, compilers, and device libraries, that we refer to as the computational environment. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment, as the impact of parameters like deep learning frameworks, compiler optimizations, and hardware devices on model performance and correctness is not yet well understood. In this paper we present a differential testing framework, DeltaNN, that allows us to assess the impact of different computational environment parameters on the performance of image recognition models during deployment, post training. DeltaNN generates different implementations of a given image recognition model for variations in environment parameters, namely, deep learning frameworks, compiler optimizations and hardware devices and analyzes differences in model performance as a result. Using DeltaNN, we conduct an empirical study of robustness analysis of three popular image recognition models using the ImageNet dataset. We report the impact in terms of misclassifications and inference time differences across different settings. In total, we observed up to 100% output label differences across deep learning frameworks, and up to 81% unexpected performance degradation in terms of inference time, when applying compiler optimizations.
comment: 11 pages, 10 figures, 2 tables
♻ ☆ Long Solution Times or Low Solution Quality: On Trade-Offs in Choosing a Power Flow Formulation for the Optimal Power Shutoff Problem
The Optimal Power Shutoff (OPS) problem is an optimization problem that makes power line de-energization decisions in order to reduce the risk of igniting a wildfire, while minimizing the load shed of customers. This problem, with DC linear power flow equations, has been used in many studies in recent years. However, using linear approximations for power flow when making decisions on the network topology is known to cause challenges with AC feasibility of the resulting network, as studied in the related contexts of optimal transmission switching or grid restoration planning. This paper explores the accuracy of the DC OPS formulation and the ability to recover an AC-feasible power flow solution after de-energization decisions are made. We also extend the OPS problem to include variants with the AC, Second-Order-Cone, and Network-Flow power flow equations, and compare them to the DC approximation with respect to solution quality and time. The results highlight that the DC approximation overestimates the amount of load that can be served, leading to poor de-energization decisions. The AC and SOC-based formulations are better, but prohibitively slow to solve for even modestly sized networks thus demonstrating the need for new solution methods with better trade-offs between computational time and solution quality.
♻ ☆ Investigating Robustness in Cyber-Physical Systems: Specification-Centric Analysis in the face of System Deviations
The adoption of cyber-physical systems (CPS) is on the rise in complex physical environments, encompassing domains such as autonomous vehicles, the Internet of Things (IoT), and smart cities. A critical attribute of CPS is robustness, denoting its capacity to operate safely despite potential disruptions and uncertainties in the operating environment. This paper proposes a novel specification-based robustness, which characterizes the effectiveness of a controller in meeting a specified system requirement, articulated through Signal Temporal Logic (STL) while accounting for possible deviations in the system. This paper also proposes the robustness falsification problem based on the definition, which involves identifying minor deviations capable of violating the specified requirement. We present an innovative two-layer simulation-based analysis framework designed to identify subtle robustness violations. To assess our methodology, we devise a series of benchmark problems wherein system parameters can be adjusted to emulate various forms of uncertainties and disturbances. Initial evaluations indicate that our falsification approach proficiently identifies robustness violations, providing valuable insights for comparing robustness between conventional and reinforcement learning (RL)-based controllers
comment: 12 pages
♻ ☆ On the Feedback Law in Stochastic Optimal Nonlinear Control
We consider the problem of nonlinear stochastic optimal control. This problem is thought to be fundamentally intractable owing to Bellman's ``curse of dimensionality". We present a result that shows that repeatedly solving an open-loop deterministic problem from the current state with progressively shorter horizons, similar to Model Predictive Control (MPC), results in a feedback policy that is $O(\epsilon^4)$ near to the true global stochastic optimal policy, \nxx{where $\epsilon$ is a perturbation parameter modulating the noise.} We show that the optimal deterministic feedback problem has a perturbation structure in that higher-order terms of the feedback law do not affect lower-order terms, and that this structure is lost in the optimal stochastic feedback problem. Consequently, solving the Stochastic Dynamic Programming problem is highly susceptible to noise, even when tractable, and in practice, the MPC-type feedback law offers superior performance even for stochastic systems.
comment: arXiv admin note: substantial text overlap with arXiv:2002.10505, arXiv:2002.09478
♻ ☆ Carbon Footprint Reduction for Sustainable Data Centers in Real-Time
As machine learning workloads significantly increase energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a real-time controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions.
♻ ☆ Identification of Energy Management Configuration Concepts from a Set of Pareto-optimal Solutions
Implementing resource efficient energy management systems in facilities and buildings becomes increasingly important in the transformation to a sustainable society. However, selecting a suitable configuration based on multiple, typically conflicting objectives, such as cost, robustness with respect to uncertainty of grid operation, or renewable energy utilization, is a difficult multi-criteria decision making problem. The recently developed concept identification technique can facilitate a decision maker by sorting configuration options into semantically meaningful groups (concepts). In this process, the partitioning of the objectives and design parameters into different sets (called description spaces) is a very important step. In this study we focus on utilizing the concept identification technique for finding relevant and viable energy management configurations from a very large data set of Pareto-optimal solutions. The data set consists of 20000 realistic Pareto-optimal building energy management configurations generated by a many-objective evolutionary optimization of a high quality Digital Twin energy management simulator. We analyze how the choice of description spaces, i.e., the partitioning of the objectives and parameters, impacts the type of information that can be extracted. We show that the decision maker can introduce constraints and biases into that process to meet expectations and preferences. The iterative approach presented in this work allows for the generation of valuable insights into trade-offs between specific objectives, and constitutes a powerful and flexible tool to support the decision making process when designing large and complex energy management systems.
comment: 18 pages, 8 figures, accepted at Energy Conversion and Management: X
♻ ☆ Hybrid physics-informed metabolic cybergenetics: process rates augmented with machine-learning surrogates informed by flux balance analysis
Metabolic cybergenetics is a promising concept that interfaces gene expression and cellular metabolism with computers for real-time dynamic metabolic control. The focus is on control at the transcriptional level, serving as a means to modulate intracellular metabolic fluxes. Recent strategies in this field have employed constraint-based dynamic models for process optimization, control, and estimation. However, this results in bilevel dynamic optimization problems, which pose considerable numerical and conceptual challenges. In this study, we present an alternative hybrid physics-informed dynamic modeling framework for metabolic cybergenetics, aimed at simplifying optimization, control, and estimation tasks. By utilizing machine-learning surrogates, our approach effectively embeds the physics of metabolic networks into the process rates of structurally simpler macro-kinetic models coupled with gene expression. These surrogates, informed by flux balance analysis, link the domains of manipulatable intracellular enzymes to metabolic exchange fluxes. This ensures that critical knowledge captured by the system's metabolic network is preserved. The resulting models can be integrated into metabolic cybergenetic schemes involving single-level optimizations. Additionally, the hybrid modeling approach maintains the number of system states at a necessary minimum, easing the burden of process monitoring and estimation. Our hybrid physics-informed metabolic cybergenetic framework is demonstrated using a computational case study on the optogenetically-assisted production of itaconate by $\textit{Escherichia coli}$.
comment: 25 pages, 10 figures, journal submission (reviewed/accepted version)
♻ ☆ Robust Integral Consensus Control of Multi-Agent Networks Perturbed by Matched and Unmatched Disturbances: The Case of Directed Graphs
This work presents a new method to design consensus controllers for perturbed double integrator systems whose interconnection is described by a directed graph containing a rooted spanning tree. We propose new robust controllers to solve the consensus and synchronization problems when the systems are under the effects of matched and unmatched disturbances. In both problems, we present simple continuous controllers, whose integral actions allow us to handle the disturbances. A rigorous stability analysis based on Lyapunov's direct method for unperturbed networked systems is presented. To assess the performance of our result, a representative simulation study is presented.
♻ ☆ Joint Ranging and Phase Offset Estimation for Multiple Drones using ADS-B Signatures
A new method for joint ranging and Phase Offset (PO) estimation of multiple drones/aircrafts is proposed in this paper. The proposed method employs the superimposed uncoordinated Automatic Dependent Surveillance Broadcast (ADS-B) packets broadcasted by drones/aircrafts for joint range and PO estimation. It jointly estimates range and PO prior to ADS-B packet decoding; thus, it can improve air safety when packet decoding is infeasible due to packet collision. Moreover, it enables coherent detection of ADS-B packets, which can result in more reliable multiple target tracking in aviation systems using cooperative sensors for detect and avoid (DAA). By minimizing the Kullback Leibler Divergence (KLD) statistical distance measure, we show that the received complex baseband signal coming from K uncoordinated drones corrupted by Additive White Gaussian Noise (AWGN) at a single antenna receiver can be approximated by an independent and identically distributed Gaussian Mixture (GM) with 2 power K mixture components in the two dimensional (2D) plane. While direct joint Maximum Likelihood Estimation (MLE) of range and PO from the derived GM Probability Density Function (PDF) leads to an intractable maximization, our proposed method employs the Expectation Maximization (EM) algorithm to estimate the modes of the 2D Gaussian mixture followed by a reordering estimation technique through combinatorial optimization to estimate range and PO. An extension to a multiple antenna receiver is also investigated in this paper. While the proposed estimator can estimate the range of multiple drones with a single receive antenna, a larger number of drones can be supported with higher accuracy by the use of multiple antennas at the receiver. The effectiveness of the proposed estimator is supported by simulation results. We show that the proposed estimator can jointly estimate the range of three drones accurately.
♻ ☆ PowerSimulationsDynamics.jl -- An Open Source Modeling Package for Modern Power Systems with Inverter-Based Resources
In this paper we present the development of an open-source simulation toolbox, PowerSimulationsDynamics.jl, to study the dynamic response of power systems, focusing on the requirements to model systems with high penetrations of Inverter-Based Resources (IBRs). PowerSimulationsDynamics.jl is implemented in Julia and features a rich library of synchronous generator, inverter, and load models. In addition, it allows the study of quasi-static phasors and electromagnetic dq models that use a dynamic network representation. Case studies and validation exercises show that PowerSimulationsDynamics.jl results closely match other commercial and open-source simulation tools.
♻ ☆ Secure Control of Connected and Automated Vehicles Using Trust-Aware Robust Event-Triggered Control Barrier Functions
We address the security of a network of Connected and Automated Vehicles (CAVs) cooperating to safely navigate through a conflict area (e.g., traffic intersections, merging roadways, roundabouts). Previous studies have shown that such a network can be targeted by adversarial attacks causing traffic jams or safety violations ending in collisions. We focus on attacks targeting the V2X communication network used to share vehicle data and consider as well uncertainties due to noise in sensor measurements and communication channels. To combat these, motivated by recent work on the safe control of CAVs, we propose a trust-aware robust event-triggered decentralized control and coordination framework that can provably guarantee safety. We maintain a trust metric for each vehicle in the network computed based on their behavior and used to balance the tradeoff between conservativeness (when deeming every vehicle as untrustworthy) and guaranteed safety and security. It is important to highlight that our framework is invariant to the specific choice of the trust framework. Based on this framework, we propose an attack detection and mitigation scheme which has twofold benefits: (i) the trust framework is immune to false positives, and (ii) it provably guarantees safety against false positive cases. We use extensive simulations (in SUMO and CARLA) to validate the theoretical guarantees and demonstrate the efficacy of our proposed scheme to detect and mitigate adversarial attacks.
comment: arXiv admin note: substantial text overlap with arXiv:2305.16818
♻ ☆ Thermally-Resilient Soft Gripper for On-Orbit Operations IROS 2024
Research in soft manipulators has significantly enhanced object grasping capabilities, thanks to their adaptability to various shapes and sizes. Applying this technology to on-orbit servicing, especially during the capture and containment stages of active space debris removal missions, might offer a secure, adaptable, and cost-effective solution compared to the trend of increasing the degrees of freedom and complexity of the manipulator (e.g. ClearSpace, Astroscale). This work aims to conduct an experimental proof of concept, for which challenges such as radiation, vacuum, and microgravity are significant, but the predominant issue is ensuring effective operation in the extreme temperature swings, where flexible materials may exhibit cryogenic crystallization or drastic shifts in their elasticity. This work addresses this challenge through an initial stage of analytical modeling of the thermal dynamics inside the manipulator in orbit; which is then used for the development of a first experimental prototype tested with liquid nitrogen and heat guns. The multi-layered design for Low Earth Orbit (LEO) leverages the properties of TPU at low infill rates for lightweight inherent flexibility, silicone rubber ensuring structural integrity, PTFE (Teflon) for unparalleled thermal stability, and aerogel for insulation. The tendon-actuated servo-driven gripper is tested in the laboratory by varying the shape and size of objects during the grasping. The results, based on servomotor force metrics to assess the flexible manipulator's adaptability and object capture efficiency across temperature changes, affirm the concept's viability. Forces increase up to 220$\%$ in cryogenic conditions and decrease by no more than 50$\%$ at high temperatures.
comment: Submitted to IROS 2024
♻ ☆ Directionality-Aware Mixture Model Parallel Sampling for Efficient Linear Parameter Varying Dynamical System Learning
The Linear Parameter Varying Dynamical System (LPV-DS) is an effective approach that learns stable, time-invariant motion policies using statistical modeling and semi-definite optimization to encode complex motions for reactive robot control. Despite its strengths, the LPV-DS learning approach faces challenges in achieving a high model accuracy without compromising the computational efficiency. To address this, we introduce the Directionality-Aware Mixture Model (DAMM), a novel statistical model that applies the Riemannian metric on the n-sphere $\mathbb{S}^n$ to efficiently blend non-Euclidean directional data with $\mathbb{R}^m$ Euclidean states. Additionally, we develop a hybrid Markov chain Monte Carlo technique that combines Gibbs Sampling with Split/Merge Proposal, allowing for parallel computation to drastically speed up inference. Our extensive empirical tests demonstrate that LPV-DS integrated with DAMM achieves higher reproduction accuracy, better model efficiency, and near real-time/online learning compared to standard estimation methods on various datasets. Lastly, we demonstrate its suitability for incrementally learning multi-behavior policies in real-world robot experiments.
♻ ☆ Stable Reduced-Rank VAR Identification
The vector autoregression (VAR) has been widely used in system identification, econometrics, natural science, and many other areas. However, when the state dimension becomes large the parameter dimension explodes. So rank reduced modelling is attractive and is well developed. But a fundamental requirement in almost all applications is stability of the fitted model. And this has not been addressed in the rank reduced case. Here, we develop, for the first time, a closed-form formula for an estimator of a rank reduced transition matrix which is guaranteed to be stable. We show that our estimator is consistent and asymptotically statistically efficient and illustrate it in comparative simulations.
comment: 16 pages, 6 figures
Multiagent Systems
☆ Symbolic and User-friendly Geometric Algebra Routines (SUGAR) for Computations in Matlab
Geometric algebra (GA) is a mathematical tool for geometric computing, providing a framework that allows a unified and compact approach to geometric relations which in other mathematical systems are typically described using different more complicated elements. This fact has led to an increasing adoption of GA in applied mathematics and engineering problems. However, the scarcity of symbolic implementations of GA and its inherent complexity, requiring a specific mathematical background, make it challenging and less intuitive for engineers to work with. This prevents wider adoption among more applied professionals. To address this challenge, this paper introduces SUGAR (Symbolic and User-friendly Geometric Algebra Routines), an open-source toolbox designed for Matlab and licensed under the MIT License. SUGAR facilitates the translation of GA concepts into Matlab and provides a collection of user-friendly functions tailored for GA computations, including support for symbolic operations. It supports both numeric and symbolic computations in high-dimensional GAs. Specifically tailored for applied mathematics and engineering applications, SUGAR has been meticulously engineered to represent geometric elements and transformations within two and three-dimensional projective and conformal geometric algebras, aligning with established computational methodologies in the literature. Furthermore, SUGAR efficiently handles functions of multivectors, such as exponential, logarithmic, sinusoidal, and cosine functions, enhancing its applicability across various engineering domains, including robotics, control systems, and power electronics. Finally, this work includes four distinct validation examples, demonstrating SUGAR's capabilities across the above-mentioned fields and its practical utility in addressing real-world applied mathematics and engineering problems.
comment: 33 pages, 6 figures, journal paper submitted to ACM TOMS
Databases
☆ Chase Termination Beyond Polynomial Time
The chase is a widely implemented approach to reason with tuple-generating dependencies (tgds), used in data exchange, data integration, and ontology-based query answering. However, it is merely a semi-decision procedure, which may fail to terminate. Many decidable conditions have been proposed for tgds to ensure chase termination, typically by forbidding some kind of "cycle" in the chase process. We propose a new criterion that explicitly allows some such cycles, and yet ensures termination of the standard chase under reasonable conditions. This leads to new decidable fragments of tgds that are not only syntactically more general but also strictly more expressive than the fragments defined by prior acyclicity conditions. Indeed, while known terminating fragments are restricted to PTime data complexity, our conditions yield decidable languages for any k-ExpTime. We further refine our syntactic conditions to obtain fragments of tgds for which an optimised chase procedure decides query entailment in PSpace or k-ExpSpace, respectively.
☆ Implementing and Evaluating E2LSH on Storage
Locality sensitive hashing (LSH) is one of the widely-used approaches to approximate nearest neighbor search (ANNS) in high-dimensional spaces. The first work on LSH for the Euclidean distance, E2LSH, showed how ANNS can be solved efficiently at a sublinear query time in the database size with theoretically-guaranteed accuracy, although it required a large hash index size. Since then, several LSH variants having much smaller index sizes have been proposed. Their query time is linear or superlinear, but they have been shown to run effectively faster because they require fewer I/Os when the index is stored on hard disk drives and because they also permit in-memory execution with modern DRAM capacity. In this paper, we show that E2LSH is regaining the advantage in query speed with the advent of modern flash storage devices such as solid-state drives (SSDs). We evaluate E2LSH on a modern single-node computing environment and analyze its computational cost and I/O cost, from which we derive storage performance requirements for its external memory execution. Our analysis indicates that E2LSH on a single consumer-grade SSD can run faster than the state-of-the-art small-index methods executed in-memory. It also indicates that E2LSH with emerging high-performance storage devices and interfaces can approach in-memory E2LSH speeds. We implement a simple adaptation of E2LSH to external memory, E2LSH-on-Storage (E2LSHoS), and evaluate it for practical large datasets of up to one billion objects using different combinations of modern storage devices and interfaces. We demonstrate that our E2LSHoS implementation runs much faster than small-index methods and can approach in-memory E2LSH speeds, and also that its query time scales sublinearly with the database size beyond the index size limit of in-memory E2LSH.
☆ Corra: Correlation-Aware Column Compression
Column encoding schemes have witnessed a spark of interest lately. This is not surprising -- as data volume increases, being able to keep one's dataset in main memory for fast processing is a coveted desideratum. However, it also seems that single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is because they do not exploit correlations in the data. Consider for instance the column pair ($\texttt{city}$, $\texttt{zip-code}$) of the DMV dataset: a city has only a few dozen unique zip codes. Such information, if properly exploited, can significantly reduce the space consumption of the latter column. In this work, we depart from the established, well-trodden path of compressing data using only single-column encoding schemes and introduce $\textit{correlation-aware}$ encoding schemes. We demonstrate their advantages compared to single-column encoding schemes on the well-known TPC-H's $\texttt{lineitem}$, LDBC's $\texttt{message}$, DMV, and Taxi. For example, we obtain a saving rate of 58.3% for $\texttt{lineitem}$'s $\texttt{shipdate}$, while the dropoff timestamps in Taxi witness a saving rate of 30.6%.
☆ GGDMiner - Discovery of Graph Generating Dependencies for Graph Data Profiling
With the increasing use of graph-structured data, there is also increasing interest in investigating graph data dependencies and their applications, e.g., in graph data profiling. Graph Generating Dependencies (GGDs) are a class of dependencies for property graphs that can express the relation between different graph patterns and constraints based on their attribute similarities. Rich syntax and semantics of GGDs make them a good candidate for graph data profiling. Nonetheless, GGDs are difficult to define manually, especially when there are no data experts available. In this paper, we propose GGDMiner, a framework for discovering approximate GGDs from graph data automatically, with the intention of profiling graph data through GGDs for the user. GGDMiner has three main steps: (1) pre-processing, (2) candidate generation, and, (3) GGD extraction. To optimize memory consumption and execution time, GGDMiner uses a factorized representation of each discovered graph pattern, called Answer Graph. Our results show that the discovered set of GGDs can give an overview about the input graph, both schema level information and also correlations between the graph patterns and attributes.
♻ ☆ ProvDeploy: Provenance-oriented Containerization of High Performance Computing Scientific Workflows
Many existing scientific workflows require High Performance Computing environments to produce results in a timely manner. These workflows have several software library components and use different environments, making the deployment and execution of the software stack not trivial. This complexity increases if the user needs to add provenance data capture services to the workflow. This manuscript introduces ProvDeploy to assist the user in configuring containers for scientific workflows with integrated provenance data capture. ProvDeploy was evaluated with a Scientific Machine Learning workflow, exploring containerization strategies focused on provenance in two distinct HPC environments
♻ ☆ When is Shapley Value Computation a Matter of Counting? PODS'24
The Shapley value provides a natural means of quantifying the contributions of facts to database query answers. In this work, we seek to broaden our understanding of Shapley value computation (SVC) in the database setting by revealing how it relates to Fixed-size Generalized Model Counting (FGMC), which is the problem of computing the number of sub-databases of a given size and containing a given set of assumed facts that satisfy a fixed query. Our focus will be on explaining the difficulty of SVC via FGMC, and to this end, we identify general conditions on queries which enable reductions from FGMC to SVC. As a byproduct, we not only obtain alternative explanations for most existing results on SVC, but also new complexity results. In particular, we establish FP-#P complexity dichotomies for constant-free connected UCQs and homomorphism-closed connected graph queries. We further explore variants of SVC, either in the absence of assumed facts, or where we measure the contribution of constants rather than facts.
comment: Published at PODS'24
Emerging Technologies
☆ Partially-Precise Computing Paradigm for Efficient Hardware Implementation of Application-Specific Embedded Systems
Nowadays, the number of emerging embedded systems rapidly grows in many application domains, due to recent advances in artificial intelligence and internet of things. The main inherent specification of these application-specific systems is that they have not a general nature and are basically developed to only perform a particular task and therefore, deal only with a limited and predefined range of custom input values. Despite this significant feature, these emerging applications are still conventionally implemented using general-purpose and precise digital computational blocks, which are essentially developed to provide the correct result for all possible input values. This highly degrades the physical properties of these applications while does not improve their functionality. To resolve this conflict, a novel computational paradigm named as partially-precise computing is introduced in this paper, based on an inspiration from the brain information reduction hypothesis as a tenet of neuroscience. The main specification of a Partially-Precise Computational (PPC) block is that it provides the precise result only for a desired, limited, and predefined set of input values. This relaxes its internal structure which results in improved physical properties with respect to a conventional precise block. The PPC blocks improve the implementation costs of the embedded applications, with a negligible or even without any output quality degradation with respect to the conventional implementation. The applicability and efficiency of the first instances of PPC adders and multipliers in a Gaussian denoising filter, an image blending and a face recognition neural network are demonstrated by means of a wide range of simulation and synthesis results.
comment: main article is 12 pages and supplementary notes is 6 pages
☆ Molecular Communication-Based Intelligent Dopamine Rate Modulator for Parkinson's Disease Treatment
Parkinson's disease (PD) is a progressive neurodegenerative disease, and it is caused by the loss of dopaminergic neurons in the basal ganglia (BG). Currently, there is no definite cure for PD, and available treatments mainly aim to alleviate its symptoms. Due to impaired neurotransmitter-based information transmission in PD, molecular communication-based approaches can be employed as potential solutions to address this issue. Molecular Communications (MC) is a bio-inspired communication method utilizing molecules for carrying information. This mode of communication stands out for developing bio-compatible nanomachines for diagnosing and treating, particularly in addressing neurodegenerative diseases like PD, due to its compatibility with biological systems. This study presents a novel treatment method that introduces an Intelligent Dopamine Rate Modulator (IDRM), which is located in the synaptic gap between the substantia nigra pars compacta (SNc) and striatum to compensate for insufficiency dopamine release in BG caused by PD. For storing dopamine in the IDRM, dopamine compound (DAC) is swallowed and crossed through the digestive system, blood circulatory system, blood-brain barrier (BBB), and brain extracellular matrix uptakes with IDRMs. Here, the DAC concentration is calculated in these regions, revealing that the required exogenous dopamine consistently reaches IDRM. Therefore, the perpetual dopamine insufficiency in BG associated with PD can be compensated. This method reduces drug side effects because dopamine is not released in other brain regions. Unlike other treatments, this approach targets the root cause of PD rather than just reducing symptoms.
comment: 10 pages, 9 figures
☆ Electron-Tunnelling-Noise Programmable Random Variate Accelerator for Monte Carlo Sampling
This article presents an electron tunneling noise programmable random variate accelerator for accelerating the sampling stage of Monte Carlo simulations. We used the LiteX framework to generate a Petitbateau FemtoRV RISC-V instruction set soft processor and deploy it on a Digilent Arty-100T FPGA development board. The RISC-V soft processor augmented with our programmable random variate accelerator achieves an average speedup of 8.70 times and a median speedup of 8.68 times for a suite of twelve different benchmark applications when compared to GNU Scientific Library software random number generation. These speedups are achievable because the benchmarks spend an average of 90.0 % of their execution time generating random samples. The results of the Monte Carlo benchmark programs run over the programmable random variate accelerator have an average Wasserstein distance of 1.48 times and a median Wasserstein distance of 1.41 times$that of the results produced by the GNU Scientific Library random number generators. The soft processor samples the electron tunneling noise source using the hardened XADC block in the FPGA. The flexibility of the LiteX framework allows for the deployment of any LiteX-supported soft processor with an electron tunneling noise programmable random variate accelerator on any LiteX-supported development board that contains an FPGA with an XADC.
Human-Computer Interaction
☆ "It Is Easy Using My Apps:" Understanding Technology Use and Needs of Adults with Down Syndrome CHI 2024
Assistive technologies for adults with Down syndrome (DS) need designs tailored to their specific technology requirements. While prior research has explored technology design for individuals with intellectual disabilities, little is understood about the needs and expectations of adults with DS. Assistive technologies should leverage the abilities and interests of the population, while incorporating age- and context-considerate content. In this work, we interviewed six adults with DS, seven parents of adults with DS, and three experts in speech-language pathology, special education, and occupational therapy to determine how technology could support adults with DS. In our thematic analysis, four main themes emerged, including (1) community vs. home social involvement; (2) misalignment of skill expectations between adults with DS and parents; (3) family limitations in technology support; and (4) considerations for technology development. Our findings extend prior literature by including the voices of adults with DS in how and when they use technology.
comment: 17 pages, 4 figures, to be published in ACM CHI 2024
☆ "How do people decide?": A Model for Software Library Selection
Modern-day software development is often facilitated by the reuse of third-party software libraries. Despite the significant effort to understand the factors contributing to library selection, it is relatively unknown how the libraries are selected and what tools are still needed to support the selection process. Using Straussian grounded theory, we conducted and analyzed the interviews of 24 professionals across the world and derived a model of library selection process which is governed by six selection patterns (i.e., rules). The model draws from marketing theory and lays the groundwork for the development of a library selection tool which captures the technical and non-technical aspects developers consider.
☆ SQL-Encoder: Improving NL2SQL In-Context Learning Through a Context-Aware Encoder
Detecting structural similarity between queries is essential for selecting examples in in-context learning models. However, assessing structural similarity based solely on the natural language expressions of queries, without considering SQL queries, presents a significant challenge. This paper explores the significance of this similarity metric and proposes a model for accurately estimating it. To achieve this, we leverage a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model. Our comprehensive evaluation demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics. Notably, our model outperforms strong competitive embedding models from OpenAI and Cohere. Furthermore, compared to these competitive models, our proposed encoder enhances the downstream performance of NL2SQL models in 1-shot in-context learning scenarios by 1-2\% for GPT-3.5-turbo, 4-8\% for CodeLlama-7B, and 2-3\% for CodeLlama-13B.
☆ Designing Child-Centric AI Learning Environments: Insights from LLM-Enhanced Creative Project-Based Learning
Project-based learning (PBL) is an instructional method that is very helpful in nurturing students' creativity, but it requires significant time and energy from both students and teachers. Large language models (LLMs) have been proven to assist in creative tasks, yet much controversy exists regarding their role in fostering creativity. This paper explores the potential of LLMs in PBL settings, with a special focus on fostering creativity. We began with an exploratory study involving 12 middle school students and identified five design considerations for LLM applications in PBL. Building on this, we developed an LLM-empowered, 48-hour PBL program and conducted an instructional experiment with 31 middle school students. Our results indicated that LLMs can enhance every stage of PBL. Additionally, we also discovered ambivalent perspectives among students and mentors toward LLM usage. Furthermore, we explored the challenge and design implications of integrating LLMs into PBL and reflected on the program. By bridging AI advancements into educational practice, our work aims to inspire further discourse and investigation into harnessing AI's potential in child-centric educational settings.
☆ Designing Upper-Body Gesture Interaction with and for People with Spinal Muscular Atrophy in VR CHI
Recent research proposed gaze-assisted gestures to enhance interaction within virtual reality (VR), providing opportunities for people with motor impairments to experience VR. Compared to people with other motor impairments, those with Spinal Muscular Atrophy (SMA) exhibit enhanced distal limb mobility, providing them with more design space. However, it remains unknown what gaze-assisted upper-body gestures people with SMA would want and be able to perform. We conducted an elicitation study in which 12 VR-experienced people with SMA designed upper-body gestures for 26 VR commands, and collected 312 user-defined gestures. Participants predominantly favored creating gestures with their hands. The type of tasks and participants' abilities influence their choice of body parts for gesture design. Participants tended to enhance their body involvement and preferred gestures that required minimal physical effort, and were aesthetically pleasing. Our research will contribute to creating better gesture-based input methods for people with motor impairments to interact with VR.
comment: Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA
☆ Understanding the Impact of Referent Design on Scale Perception in Immersive Data Visualization CHI
Referents are often used to enhance scale perception in immersive visualizations. Common referent designs include the considerations of referent layout (side-by-side vs. in-situ) and referent size (small vs. medium vs. large). This paper introduces a controlled user study to assess how different referent designs affect the efficiency and accuracy of scale perception across different data scales, on the performance of the size-matching task in the virtual environment. Our results reveal that in-situ layouts significantly enhance accuracy and confidence across various data scales, particularly with large referents. Linear regression analyses further confirm that in-situ layouts exhibit greater resilience to changes in data scale. For tasks requiring efficiency, medium-sized referents emerge as the preferred choice. Based on these findings, we offer design guidelines for selecting referent layouts and sizes in immersive visualizations.
comment: 7 pages, 6 figures, Accepted to Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24)
☆ Persuasion or Insulting? Unpacking Discursive Strategies of Gender Debate in Everyday Feminism in China CHI
Speaking out for women's daily needs on social media has become a crucial form of everyday feminism in China. Gender debate naturally intertwines with such feminist advocacy, where users in opposite stances discuss gender-related issues through intense discourse. The complexities of gender debate necessitate a systematic understanding of discursive strategies for achieving effective gender communication that balances civility and constructiveness. To address this problem, we adopted a mixed-methods study to navigate discursive strategies in gender debate, focusing on 38,636 posts and 187,539 comments from two representative cases in China. Through open coding, we identified a comprehensive taxonomy of linguistic strategies in gender debate, capturing five overarching themes including derogation, gender distinction, intensification, mitigation, and cognizance guidance. Further, we applied regression analysis to unveil these strategies' correlations with user participation and response, illustrating the tension between debating tactics and public engagement. We discuss design implications to facilitate feminist advocacy on social media.
comment: 19 pages, 3 figures, In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24)
♻ ☆ Development and Evaluation of a Learning-based Model for Real-time Haptic Texture Rendering
Current Virtual Reality (VR) environments lack the rich haptic signals that humans experience during real-life interactions, such as the sensation of texture during lateral movement on a surface. Adding realistic haptic textures to VR environments requires a model that generalizes to variations of a user's interaction and to the wide variety of existing textures in the world. Current methodologies for haptic texture rendering exist, but they usually develop one model per texture, resulting in low scalability. We present a deep learning-based action-conditional model for haptic texture rendering and evaluate its perceptual performance in rendering realistic texture vibrations through a multi part human user study. This model is unified over all materials and uses data from a vision-based tactile sensor (GelSight) to render the appropriate surface conditioned on the user's action in real time. For rendering texture, we use a high-bandwidth vibrotactile transducer attached to a 3D Systems Touch device. The result of our user study shows that our learning-based method creates high-frequency texture renderings with comparable or better quality than state-of-the-art methods without the need for learning a separate model per texture. Furthermore, we show that the method is capable of rendering previously unseen textures using a single GelSight image of their surface.
comment: Accepted for publication in IEEE Transactions on Haptics 2024. 12 pages, 8 figures
♻ ☆ Assessing cognitive function among older adults using machine learning and wearable device data: a feasibility study
Timely implementation of interventions to slow cognitive decline among older adults requires accurate monitoring to detect changes in cognitive function. Data gathered using wearable devices that can continuously monitor factors known to be associated with cognition could be used to train machine learning models and develop wearable-based cognitive monitoring systems. Using data from over 2,400 older adults in the National Health and Nutrition Examination Survey (NHANES) we developed prediction models to differentiate older adults with normal cognition from those with poor cognition based on outcomes from three cognitive tests measuring different domains of cognitive function. During repeated cross-validation, CatBoost, XGBoost, and Random Forest models performed best when predicting cognition based on processing speed, working memory, and attention (median AUCs >0.82) compared to immediate and delayed recall (median AUCs >0.72) and categorical verbal fluency (median AUC >0.68). Activity and sleep parameters were also more strongly associated with processing speed, working memory, and attention compared to other cognitive subdomains. Our work provides proof of concept that wearable-based cognitive monitoring systems may be a viable alternative to traditional methods for monitoring processing speeds, working memory, and attention. We further identified novel metrics that could be targets in future causal studies seeking to better understand how sleep and activity parameters influence cognitive function among older adults.
♻ ☆ Trickery: Educational Dark Pattern Analogies for Use in Serious Games CHI
Dark patterns are often used in interface design to manipulate users into performing actions they would otherwise not take, such as consenting to excessive data collection. We present a narrative serious game concept, along with seven educational dark pattern analogies designed to create awareness of and bolster resistance against dark patterns through direct consequences of player actions. We performed a qualitative laboratory gameplay study investigating player behavior when confronted with educational dark pattern analogies in a serious game and an online survey study evaluating the perceived helpfulness of our educational dark pattern analogies. Our results provide insights into influencing factors for adapting dark patterns into gameplay, as well as player motivations and driving forces influencing player behavior, and show educational dark patterns to be a promising solution to increase user understanding of dark pattern concepts.
comment: [V2]: Submitted to CHI PLAY 2024 [V1]: Submitted to Proceedings on Privacy Enhancing Technologies 2024
♻ ☆ REVERSIM: A Game-Based Environment to Study Human Aspects in Hardware Reverse Engineering
Hardware Reverse Engineering (HRE) is a technique for analyzing Integrated Circuits (ICs). Experts employ HRE for security-critical tasks, such as detecting Trojans or intellectual property violations. They rely not only on their experience and customized tools but also on their cognitive abilities. Conducting controlled experiments to assess the cognitive processes involved in HRE can open new avenues for hardware protection. However, HRE experts are largely unavailable for empirical research in real-world settings. To address this challenge, we have developed REVERSIM, a game-based environment that mimics realistic HRE subprocesses and can integrate standardized cognitive tests. REVERSIM enables quantitative studies with easier-to-recruit non-experts to uncover cognitive factors relevant to HRE, which can subsequently be validated with small expert samples. To evaluate the design of REVERSIM, the minimum requirements for successful participation, and its measurement capabilities, we conducted two studies: First, we performed semi-structured interviews with 14 professionals and researchers from the HRE domain, who attested to the comparability of REVERSIM to real-world HRE problems. Second, we conducted an online user study with 109 participants, demonstrating that they could engage in REVERSIM with low domain-specific prior knowledge. We provide refined screening criteria, derive fine-grained performance metrics, and successfully perform a cognitive test for mental speed in REVERSIM, thus contributing an important piece of the puzzle for the development of innovative hardware protection mechanisms.
♻ ☆ Towards Designing a Question-Answering Chatbot for Online News: Understanding Questions and Perspectives
Large Language Models (LLMs) have created opportunities for designing chatbots that can support complex question-answering (QA) scenarios and improve news audience engagement. However, we still lack an understanding of what roles journalists and readers deem fit for such a chatbot in newsrooms. To address this gap, we first interviewed six journalists to understand how they answer questions from readers currently and how they want to use a QA chatbot for this purpose. To understand how readers want to interact with a QA chatbot, we then conducted an online experiment (N=124) where we asked each participant to read three news articles and ask questions to either the author(s) of the articles or a chatbot. By combining results from the studies, we present alignments and discrepancies between how journalists and readers want to use QA chatbots and propose a framework for designing effective QA chatbots in newsrooms.
♻ ☆ The HaLLMark Effect: Supporting Provenance and Transparent Use of Large Language Models in Writing with Interactive Visualization
The use of Large Language Models (LLMs) for writing has sparked controversy both among readers and writers. On one hand, writers are concerned that LLMs will deprive them of agency and ownership, and readers are concerned about spending their time on text generated by soulless machines. On the other hand, AI-assistance can improve writing as long as writers can conform to publisher policies, and as long as readers can be assured that a text has been verified by a human. We argue that a system that captures the provenance of interaction with an LLM can help writers retain their agency, conform to policies, and communicate their use of AI to publishers and readers transparently. Thus we propose HaLLMark, a tool for visualizing the writer's interaction with the LLM. We evaluated HaLLMark with 13 creative writers, and found that it helped them retain a sense of control and ownership of the text.
♻ ☆ Form-From: A Design Space of Social Media Systems
Social media systems are as varied as they are pervasive. They have been almost universally adopted for a broad range of purposes including work, entertainment, activism, and decision making. As a result, they have also diversified, with many distinct designs differing in content type, organization, delivery mechanism, access control, and many other dimensions. In this work, we aim to characterize and then distill a concise design space of social media systems that can help us understand similarities and differences, recognize potential consequences of design choices, and identify spaces for innovation. Our model, which we call Form-From, characterizes social media based on (1) the form of the content, either threaded or flat, and (2) from where or from whom one might receive content, ranging from spaces to networks to the commons. We derive Form-From inductively from a larger set of 62 dimensions organized into 10 categories. To demonstrate the utility of our model, we trace the history of social media systems as they traverse the Form-From space over time, and we identify common design patterns within cells of the model.
Social and Information Networks
☆ Large Language Models in Biomedical and Health Informatics: A Bibliometric Review
Large Language Models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), enabling new ways to analyze data, treat patients, and conduct research. This bibliometric review aims to provide a panoramic view of how LLMs have been used in BHI by examining research articles and collaboration networks from 2022 to 2023. It further explores how LLMs can improve Natural Language Processing (NLP) applications in various BHI areas like medical diagnosis, patient engagement, electronic health record management, and personalized medicine. To do this, our bibliometric review identifies key trends, maps out research networks, and highlights major developments in this fast-moving field. Lastly, it discusses the ethical concerns and practical challenges of using LLMs in BHI, such as data privacy and reliable medical recommendations. Looking ahead, we consider how LLMs could further transform biomedical research as well as healthcare delivery and patient outcomes. This comprehensive review serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
comment: 50 pages, 7 figures, 4 tables
☆ A Survey on Self-Supervised Pre-Training of Graph Foundation Models: A Knowledge-Based Perspective
Graph self-supervised learning is now a go-to method for pre-training graph foundation models, including graph neural networks, graph transformers, and more recent large language model (LLM)-based graph models. There is a wide variety of knowledge patterns embedded in the structure and properties of graphs which may be used for pre-training, but we lack a systematic overview of self-supervised pre-training tasks from the perspective of graph knowledge. In this paper, we comprehensively survey and analyze the pre-training tasks of graph foundation models from a knowledge-based perspective, consisting of microscopic (nodes, links, etc) and macroscopic knowledge (clusters, global structure, etc). It covers a total of 9 knowledge categories and 25 pre-training tasks, as well as various downstream task adaptation strategies. Furthermore, an extensive list of the related papers with detailed metadata is provided at https://github.com/Newiz430/Pretext.
comment: Work in progress
☆ From Clicks to Conversions: Analysis of Traffic Sources in E-Commerce
Over the past years, e-commerce platforms have expanded substantially, providing customers with convenient shopping experiences. To enhance e-commerce websites, it's essential to grasp user engagement and factors affecting conversion rates. This optimization is achieved by aligning platforms with user expectations, thereby fostering successful online shopping experiences, and contributing to sustained growth in the dynamic digital marketplace. In this paper, we conduct a comprehensive analysis focusing on user interactions, conversion metrics, and the entire user journey within an e-commerce platform. Our exploration spans exit rates and sessions across different devices and browsers, conversion rates for various traffic sources, and the user journey from product details to checkout. Findings suggest a need for targeted improvements in mobile optimization and browser compatibility, indicated by higher exit rates on mobile devices and varying rates on different browsers. The conversion rate analysis emphasizes the varying effectiveness of traffic sources, highlighting the potential of advertised mediums, while identifying areas for improvement in referral and affiliate traffic. Examining the user journey showed potential bottlenecks in the conversion process, the identified gap was between user interest and completed transactions. Our study suggests improving the checkout process strategically to streamline user transactions. These insights provide actionable guidance for businesses seeking to refine their platforms and optimize performance in the ever-evolving landscape of e-commerce.
comment: 12th INTERNATIONAL CONFERENCE ON CONTEMPORARY ISSUES IN MANAGEMENT
☆ Node Classification via Semantic-Structural Attention-Enhanced Graph Convolutional Networks
Graph data, also known as complex network data, is omnipresent across various domains and applications. Prior graph neural network models primarily focused on extracting task-specific structural features through supervised learning objectives, but they fell short in capturing the inherent semantic and structural features of the entire graph. In this paper, we introduce the semantic-structural attention-enhanced graph convolutional network (SSA-GCN), which not only models the graph structure but also extracts generalized unsupervised features to enhance vertex classification performance. The SSA-GCN's key contributions lie in three aspects: firstly, it derives semantic information through unsupervised feature extraction from a knowledge graph perspective; secondly, it obtains structural information through unsupervised feature extraction from a complex network perspective; and finally, it integrates these features through a cross-attention mechanism. By leveraging these features, we augment the graph convolutional network, thereby enhancing the model's generalization capabilities. Our experiments on the Cora and CiteSeer datasets demonstrate the performance improvements achieved by our proposed method. Furthermore, our approach also exhibits excellent accuracy under privacy settings, making it a robust and effective solution for graph data analysis.
♻ ☆ Inductive Knowledge Graph Completion with GNNs and Rules: An Analysis
The task of inductive knowledge graph completion requires models to learn inference patterns from a training graph, which can then be used to make predictions on a disjoint test graph. Rule-based methods seem like a natural fit for this task, but in practice they significantly underperform state-of-the-art methods based on Graph Neural Networks (GNNs), such as NBFNet. We hypothesise that the underperformance of rule-based methods is due to two factors: (i) implausible entities are not ranked at all and (ii) only the most informative path is taken into account when determining the confidence in a given link prediction answer. To analyse the impact of these factors, we study a number of variants of a rule-based approach, which are specifically aimed at addressing the aforementioned issues. We find that the resulting models can achieve a performance which is close to that of NBFNet. Crucially, the considered variants only use a small fraction of the evidence that NBFNet relies on, which means that they largely keep the interpretability advantage of rule-based methods. Moreover, we show that a further variant, which does look at the full KG, consistently outperforms NBFNet.
♻ ☆ Form-From: A Design Space of Social Media Systems
Social media systems are as varied as they are pervasive. They have been almost universally adopted for a broad range of purposes including work, entertainment, activism, and decision making. As a result, they have also diversified, with many distinct designs differing in content type, organization, delivery mechanism, access control, and many other dimensions. In this work, we aim to characterize and then distill a concise design space of social media systems that can help us understand similarities and differences, recognize potential consequences of design choices, and identify spaces for innovation. Our model, which we call Form-From, characterizes social media based on (1) the form of the content, either threaded or flat, and (2) from where or from whom one might receive content, ranging from spaces to networks to the commons. We derive Form-From inductively from a larger set of 62 dimensions organized into 10 categories. To demonstrate the utility of our model, we trace the history of social media systems as they traverse the Form-From space over time, and we identify common design patterns within cells of the model.
Distributed, Parallel, and Cluster Computing
☆ Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network
on adaptive routing to balance network traffic for optimum performance. Ideally, adaptive routing attempts to forward packets between minimal and non-minimal paths with the least congestion. In practice, current adaptive routing algorithms estimate routing path congestion based on local information such as output queue occupancy. Using local information to estimate global path congestion is inevitably inaccurate because a router has no precise knowledge of link states a few hops away. This inaccuracy could lead to interconnect congestion. In this study, we present Q-adaptive routing, a multi-agent reinforcement learning routing scheme for Dragonfly systems. Q-adaptive routing enables routers to learn to route autonomously by leveraging advanced reinforcement learning technology. The proposed Q-adaptive routing is highly scalable thanks to its fully distributed nature without using any shared information between routers. Furthermore, a new two-level Q-table is designed for Q-adaptive to make it computational lightly and saves 50% of router memory usage compared with the previous Q-routing. We implement the proposed Q-adaptive routing in SST/Merlin simulator. Our evaluation results show that Q-adaptive routing achieves up to 10.5% system throughput improvement and 5.2x average packet latency reduction compared with adaptive routing algorithms. Remarkably, Q-adaptive can even outperform the optimal VALn non-minimal routing under the ADV+1 adversarial traffic pattern with up to 3% system throughput improvement and 75% average packet latency reduction.
☆ MRSch: Multi-Resource Scheduling for HPC
Emerging workloads in high-performance computing (HPC) are embracing significant changes, such as having diverse resource requirements instead of being CPU-centric. This advancement forces cluster schedulers to consider multiple schedulable resources during decision-making. Existing scheduling studies rely on heuristic or optimization methods, which are limited by an inability to adapt to new scenarios for ensuring long-term scheduling performance. We present an intelligent scheduling agent named MRSch for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm. While DFP demonstrated outstanding performance in a gaming competition, it has not been previously explored in the context of HPC scheduling. Several key techniques are developed in this study to tackle the challenges involved in multi-resource scheduling. These techniques enable MRSch to learn an appropriate scheduling policy automatically and dynamically adapt its policy in response to workload changes via dynamic resource prioritizing. We compare MRSch with existing scheduling methods through extensive tracebase simulations. Our results demonstrate that MRSch improves scheduling performance by up to 48% compared to the existing scheduling methods.
☆ Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling
In the field of high-performance computing (HPC), there has been recent exploration into the use of deep reinforcement learning for cluster scheduling (DRL scheduling), which has demonstrated promising outcomes. However, a significant challenge arises from the lack of interpretability in deep neural networks (DNN), rendering them as black-box models to system managers. This lack of model interpretability hinders the practical deployment of DRL scheduling. In this work, we present a framework called IRL (Interpretable Reinforcement Learning) to address the issue of interpretability of DRL scheduling. The core idea is to interpret DNN (i.e., the DRL policy) as a decision tree by utilizing imitation learning. Unlike DNN, decision tree models are non-parametric and easily comprehensible to humans. To extract an effective and efficient decision tree, IRL incorporates the Dataset Aggregation (DAgger) algorithm and introduces the notion of critical state to prune the derived decision tree. Through trace-based experiments, we demonstrate that IRL is capable of converting a black-box DNN policy into an interpretable rulebased decision tree while maintaining comparable scheduling performance. Additionally, IRL can contribute to the setting of rewards in DRL scheduling.
☆ Study of Workload Interference with Intelligent Routing on Dragonfly
Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload performance is often offset by network contention. Recently, intelligent routing built on reinforcement learning demonstrates higher network throughput with lower packet latency. However, its effectiveness in reducing workload interference is unknown. In this work, we present extensive network simulations to study multi-workload contention under different routing mechanisms, intelligent routing and adaptive routing, on a large-scale Dragonfly system. We develop an enhanced network simulation toolkit, along with a suite of workloads with distinctive communication patterns. We also present two metrics to characterize application communication intensity. Our analysis focuses on examining how different workloads interfere with each other under different routing mechanisms by inspecting both application-level and network-level metrics. Several key insights are made from the analysis.
☆ A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Joint consideration of scheduling and adaptive parallelism offers great opportunities for improving the training efficiency of large models on heterogeneous GPU clusters. However, integrating adaptive parallelism into a cluster scheduler expands the cluster scheduling space. The new space is the product of the original scheduling space and the parallelism exploration space of adaptive parallelism (also a product of pipeline, data, and tensor parallelism). The exponentially enlarged scheduling space and ever-changing optimal parallelism plan from adaptive parallelism together result in the contradiction between low-overhead and accurate performance data acquisition for efficient cluster scheduling. This paper presents Crius, a training system for efficiently scheduling multiple large models with adaptive parallelism in a heterogeneous cluster. Crius proposes a novel scheduling granularity called Cell. It represents a job with deterministic resources and pipeline stages. The exploration space of Cell is shrunk to the product of only data and tensor parallelism, thus exposing the potential for accurate and low-overhead performance estimation. Crius then accurately estimates Cells and efficiently schedules training jobs. When a Cell is selected as a scheduling choice, its represented job runs with the optimal parallelism plan explored. Experimental results show that Crius reduces job completion time by up to 48.9% and schedules large models with up to 1.49x cluster throughput improvement.
♻ ☆ FedNMUT -- Federated Noisy Model Update Tracking Convergence Analysis
A novel Decentralized Noisy Model Update Tracking Federated Learning algorithm (FedNMUT) is proposed that is tailored to function efficiently in the presence of noisy communication channels that reflect imperfect information exchange. This algorithm uses gradient tracking to minimize the impact of data heterogeneity while minimizing communication overhead. The proposed algorithm incorporates noise into its parameters to mimic the conditions of noisy communication channels, thereby enabling consensus among clients through a communication graph topology in such challenging environments. FedNMUT prioritizes parameter sharing and noise incorporation to increase the resilience of decentralized learning systems against noisy communications. Theoretical results for the smooth non-convex objective function are provided by us, and it is shown that the $\epsilon-$stationary solution is achieved by our algorithm at the rate of $\mathcal{O}\left(\frac{1}{\sqrt{T}}\right)$, where $T$ is the total number of communication rounds. Additionally, via empirical validation, we demonstrated that the performance of FedNMUT is superior to the existing state-of-the-art methods and conventional parameter-mixing approaches in dealing with imperfect information sharing. This proves the capability of the proposed algorithm to counteract the negative effects of communication noise in a decentralized learning framework.
comment: arXiv admin note: text overlap with arXiv:2303.10695
♻ ☆ Gradient-less Federated Gradient Boosting Trees with Learnable Learning Rates
The privacy-sensitive nature of decentralized datasets and the robustness of eXtreme Gradient Boosting (XGBoost) on tabular data raise the needs to train XGBoost in the context of federated learning (FL). Existing works on federated XGBoost in the horizontal setting rely on the sharing of gradients, which induce per-node level communication frequency and serious privacy concerns. To alleviate these problems, we develop an innovative framework for horizontal federated XGBoost which does not depend on the sharing of gradients and simultaneously boosts privacy and communication efficiency by making the learning rates of the aggregated tree ensembles learnable. We conduct extensive evaluations on various classification and regression datasets, showing our approach achieves performance comparable to the state-of-the-art method and effectively improves communication efficiency by lowering both communication rounds and communication overhead by factors ranging from 25x to 700x. Project Page: https://flower.ai/blog/2023-04-19-xgboost-with-flower/
comment: Accepted at the 3rd ACM Workshop on Machine Learning and Systems (EuroMLSys), May 8th 2023, Rome, Italy
♻ ☆ How Does Stake Distribution Influence Consensus? Analyzing Blockchain Decentralization
In the PoS blockchain landscape, the challenge of achieving full decentralization is often hindered by a disproportionate concentration of staked tokens among a few validators. This study analyses this challenge by first formalizing decentralization metrics for weighted consensus mechanisms. An empirical analysis across ten permissionless blockchains uncovers significant weight concentration among validators, underscoring the need for an equitable approach. To counter this, we introduce the Square Root Stake Weight (SRSW) model, which effectively recalibrates staking weight distribution. Our examination of the SRSW model demonstrates notable improvements in the decentralization metrics: the Gini index improves by 37.16% on average, while Nakamoto coefficients for liveness and safety see mean enhancements of 101.04% and 80.09%, respectively. This research is a pivotal step toward a more fair and equitable distribution of staking weight, advancing the decentralization in blockchain consensus mechanisms.
comment: To appear in ICBC 2024
Machine Learning
☆ Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models
We describe a novel approach for developing realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for many applications, the design process is challenging because the compressors operate nonlinearly over long time scales. Our approach is based on the structured state space sequence model (S4), as implementing the state-space model (SSM) has proven to be efficient at learning long-range dependencies and is promising for modeling dynamic range compressors. We present in this paper a deep learning model with S4 layers to model the Teletronix LA-2A analog dynamic range compressor. The model is causal, executes efficiently in real time, and achieves roughly the same quality as previous deep-learning models but with fewer parameters.
☆ Artificial Neural Microcircuits as Building Blocks: Concept and Challenges
Artificial Neural Networks (ANNs) are one of the most widely employed forms of bio-inspired computation. However the current trend is for ANNs to be structurally homogeneous. Furthermore, this structural homogeneity requires the application of complex training and learning tools that produce application specific ANNs, susceptible to pitfalls such as overfitting. In this paper, an new approach is explored, inspired by the role played in biology by Neural Microcircuits, the so called ``fundamental processing elements'' of organic nervous systems. How large neural networks, particularly Spiking Neural Networks (SNNs) can be assembled using Artificial Neural Microcircuits (ANMs), intended as off-the-shelf components, is articulated; the results of initial work to produce a catalogue of such Microcircuits though the use of Novelty Search is shown; followed by efforts to expand upon this initial work, including a discussion of challenges uncovered during these efforts and explorations of methods by which they might be overcome.
comment: 12 pages, 31 figures, 3 tables, submitted to A-Life Journal for review
☆ Optimization on a Finer Scale: Bounded Local Subgradient Variation Perspective
We initiate the study of nonsmooth optimization problems under bounded local subgradient variation, which postulates bounded difference between (sub)gradients in small local regions around points, in either average or maximum sense. The resulting class of objective functions encapsulates the classes of objective functions traditionally studied in optimization, which are defined based on either Lipschitz continuity of the objective or H\"{o}lder/Lipschitz continuity of its gradient. Further, the defined class contains functions that are neither Lipschitz continuous nor have a H\"{o}lder continuous gradient. When restricted to the traditional classes of optimization problems, the parameters defining the studied classes lead to more fine-grained complexity bounds, recovering traditional oracle complexity bounds in the worst case but generally leading to lower oracle complexity for functions that are not ``worst case.'' Some highlights of our results are that: (i) it is possible to obtain complexity results for both convex and nonconvex problems with the (local or global) Lipschitz constant being replaced by a constant of local subgradient variation and (ii) mean width of the subdifferential set around the optima plays a role in the complexity of nonsmooth optimization, particularly in parallel settings. A consequence of (ii) is that for any error parameter $\epsilon > 0$, parallel oracle complexity of nonsmooth Lipschitz convex optimization is lower than its sequential oracle complexity by a factor $\tilde{\Omega}\big(\frac{1}{\epsilon}\big)$ whenever the objective function is piecewise linear with polynomially many pieces in the input size. This is particularly surprising as existing parallel complexity lower bounds are based on such classes of functions. The seeming contradiction is resolved by considering the region in which the algorithm is allowed to query the objective.
☆ Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling
In the field of high-performance computing (HPC), there has been recent exploration into the use of deep reinforcement learning for cluster scheduling (DRL scheduling), which has demonstrated promising outcomes. However, a significant challenge arises from the lack of interpretability in deep neural networks (DNN), rendering them as black-box models to system managers. This lack of model interpretability hinders the practical deployment of DRL scheduling. In this work, we present a framework called IRL (Interpretable Reinforcement Learning) to address the issue of interpretability of DRL scheduling. The core idea is to interpret DNN (i.e., the DRL policy) as a decision tree by utilizing imitation learning. Unlike DNN, decision tree models are non-parametric and easily comprehensible to humans. To extract an effective and efficient decision tree, IRL incorporates the Dataset Aggregation (DAgger) algorithm and introduces the notion of critical state to prune the derived decision tree. Through trace-based experiments, we demonstrate that IRL is capable of converting a black-box DNN policy into an interpretable rulebased decision tree while maintaining comparable scheduling performance. Additionally, IRL can contribute to the setting of rewards in DRL scheduling.
☆ The Evolution of Football Betting- A Machine Learning Approach to Match Outcome Forecasting and Bookmaker Odds Estimation
This paper explores the significant history of professional football and the betting industry, tracing its evolution from clandestine beginnings to a lucrative multi-million-pound enterprise. Initiated by the legalization of gambling in 1960 and complemented by advancements in football data gathering pioneered by Thorold Charles Reep, the symbiotic relationship between these sectors has propelled rapid growth and innovation. Over the past six decades, both industries have undergone radical transformations, with data collection methods evolving from rudimentary notetaking to sophisticated technologies such as high-definition cameras and Artificial Intelligence (AI)-driven analytics. Therefore, the primary aim of this study is to utilize Machine Learning (ML) algorithms to forecast premier league football match outcomes. By analyzing historical data and investigating the significance of various features, the study seeks to identify the most effective predictive models and discern key factors influencing match results. Additionally, the study aims to utilize these forecasting to inform the establishment of bookmaker odds, providing insights into the impact of different variables on match outcomes. By highlighting the potential for informed decision-making in sports forecasting and betting, this study opens up new avenues for research and practical applications in the domain of sports analytics.
☆ Out-of-Distribution Detection via Deep Multi-Comprehension Ensemble
Recent research underscores the pivotal role of the Out-of-Distribution (OOD) feature representation field scale in determining the efficacy of models in OOD detection. Consequently, the adoption of model ensembles has emerged as a prominent strategy to augment this feature representation field, capitalizing on anticipated model diversity. However, our introduction of novel qualitative and quantitative model ensemble evaluation methods, specifically Loss Basin/Barrier Visualization and the Self-Coupling Index, reveals a critical drawback in existing ensemble methods. We find that these methods incorporate weights that are affine-transformable, exhibiting limited variability and thus failing to achieve the desired diversity in feature representation. To address this limitation, we elevate the dimensions of traditional model ensembles, incorporating various factors such as different weight initializations, data holdout, etc., into distinct supervision tasks. This innovative approach, termed Multi-Comprehension (MC) Ensemble, leverages diverse training tasks to generate distinct comprehensions of the data and labels, thereby extending the feature representation field. Our experimental results demonstrate the superior performance of the MC Ensemble strategy in OOD detection compared to both the naive Deep Ensemble method and a standalone model of comparable size. This underscores the effectiveness of our proposed approach in enhancing the model's capability to detect instances outside its training distribution.
☆ Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis CVPR2024
While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression, their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation, we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents, thereby facilitating the generation of high-quality images. Moreover, our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space, while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer, which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding, the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs, and the proposed entropy model contributes to notable bitrate savings.
comment: Accepted by CVPR2024
☆ Improving Sequence-to-Sequence Models for Abstractive Text Summarization Using Meta Heuristic Approaches
As human society transitions into the information age, reduction in our attention span is a contingency, and people who spend time reading lengthy news articles are decreasing rapidly and the need for succinct information is higher than ever before. Therefore, it is essential to provide a quick overview of important news by concisely summarizing the top news article and the most intuitive headline. When humans try to make summaries, they extract the essential information from the source and add useful phrases and grammatical annotations from the original extract. Humans have a unique ability to create abstractions. However, automatic summarization is a complicated problem to solve. The use of sequence-to-sequence (seq2seq) models for neural abstractive text summarization has been ascending as far as prevalence. Numerous innovative strategies have been proposed to develop the current seq2seq models further, permitting them to handle different issues like saliency, familiarity, and human lucidness and create excellent synopses. In this article, we aimed toward enhancing the present architectures and models for abstractive text summarization. The modifications have been aimed at fine-tuning hyper-parameters, attempting specific encoder-decoder combinations. We examined many experiments on an extensively used CNN/DailyMail dataset to check the effectiveness of various models.
☆ Partially Blinded Unlearning: Class Unlearning for Deep Networks a Bayesian Perspective
In order to adhere to regulatory standards governing individual data privacy and safety, machine learning models must systematically eliminate information derived from specific subsets of a user's training data that can no longer be utilized. The emerging discipline of Machine Unlearning has arisen as a pivotal area of research, facilitating the process of selectively discarding information designated to specific sets or classes of data from a pre-trained model, thereby eliminating the necessity for extensive retraining from scratch. The principal aim of this study is to formulate a methodology tailored for the purposeful elimination of information linked to a specific class of data from a pre-trained classification network. This intentional removal is crafted to degrade the model's performance specifically concerning the unlearned data class while concurrently minimizing any detrimental impacts on the model's performance in other classes. To achieve this goal, we frame the class unlearning problem from a Bayesian perspective, which yields a loss function that minimizes the log-likelihood associated with the unlearned data with a stability regularization in parameter space. This stability regularization incorporates Mohalanobis distance with respect to the Fisher Information matrix and $l_2$ distance from the pre-trained model parameters. Our novel approach, termed \textbf{Partially-Blinded Unlearning (PBU)}, surpasses existing state-of-the-art class unlearning methods, demonstrating superior effectiveness. Notably, PBU achieves this efficacy without requiring awareness of the entire training dataset but only to the unlearned data points, marking a distinctive feature of its performance.
☆ On the Equivalency, Substitutability, and Flexibility of Synthetic Data
We study, from an empirical standpoint, the efficacy of synthetic data in real-world scenarios. Leveraging synthetic data for training perception models has become a key strategy embraced by the community due to its efficiency, scalability, perfect annotations, and low costs. Despite proven advantages, few studies put their stress on how to efficiently generate synthetic datasets to solve real-world problems and to what extent synthetic data can reduce the effort for real-world data collection. To answer the questions, we systematically investigate several interesting properties of synthetic data -- the equivalency of synthetic data to real-world data, the substitutability of synthetic data for real data, and the flexibility of synthetic data generators to close up domain gaps. Leveraging the M3Act synthetic data generator, we conduct experiments on DanceTrack and MOT17. Our results suggest that synthetic data not only enhances model performance but also demonstrates substitutability for real data, with 60% to 80% replacement without performance loss. In addition, our study of the impact of synthetic data distributions on downstream performance reveals the importance of flexible data generators in narrowing domain gaps for improved model adaptability.
☆ An early warning indicator trained on stochastic disease-spreading models with different noises
The timely detection of disease outbreaks through reliable early warning signals (EWSs) is indispensable for effective public health mitigation strategies. Nevertheless, the intricate dynamics of real-world disease spread, often influenced by diverse sources of noise and limited data in the early stages of outbreaks, pose a significant challenge in developing reliable EWSs, as the performance of existing indicators varies with extrinsic and intrinsic noises. Here, we address the challenge of modeling disease when the measurements are corrupted by additive white noise, multiplicative environmental noise, and demographic noise into a standard epidemic mathematical model. To navigate the complexities introduced by these noise sources, we employ a deep learning algorithm that provides EWS in infectious disease outbreak by training on noise-induced disease-spreading models. The indicator's effectiveness is demonstrated through its application to real-world COVID-19 cases in Edmonton and simulated time series derived from diverse disease spread models affected by noise. Notably, the indicator captures an impending transition in a time series of disease outbreaks and outperforms existing indicators. This study contributes to advancing early warning capabilities by addressing the intricate dynamics inherent in real-world disease spread, presenting a promising avenue for enhancing public health preparedness and response efforts.
☆ CoverUp: Coverage-Guided LLM-Based Test Generation
This paper presents CoverUp, a novel system that drives the generation of high-coverage Python regression tests via a combination of coverage analysis and large-language models (LLMs). CoverUp iteratively improves coverage, interleaving coverage analysis with dialogs with the LLM to focus its attention on as yet uncovered lines and branches. The resulting test suites significantly improve coverage over the current state of the art: compared to CodaMosa, a hybrid LLM / search-based software testing system, CoverUp substantially improves coverage across the board. On a per-module basis, CoverUp achieves median line coverage of 81% (vs. 62%), branch coverage of 53% (vs. 35%) and line+branch coverage of 78% (vs. 55%). We show that CoverUp's iterative, coverage-guided approach is crucial to its effectiveness, contributing to nearly half of its successes.
comment: 11 pages
☆ Systematic construction of continuous-time neural networks for linear dynamical systems
Discovering a suitable neural network architecture for modeling complex dynamical systems poses a formidable challenge, often involving extensive trial and error and navigation through a high-dimensional hyper-parameter space. In this paper, we discuss a systematic approach to constructing neural architectures for modeling a subclass of dynamical systems, namely, Linear Time-Invariant (LTI) systems. We use a variant of continuous-time neural networks in which the output of each neuron evolves continuously as a solution of a first-order or second-order Ordinary Differential Equation (ODE). Instead of deriving the network architecture and parameters from data, we propose a gradient-free algorithm to compute sparse architecture and network parameters directly from the given LTI system, leveraging its properties. We bring forth a novel neural architecture paradigm featuring horizontal hidden layers and provide insights into why employing conventional neural architectures with vertical hidden layers may not be favorable. We also provide an upper bound on the numerical errors of our neural networks. Finally, we demonstrate the high accuracy of our constructed networks on three numerical examples.
comment: 37 pages, 25 figures
☆ Leveraging Deep Learning and Xception Architecture for High-Accuracy MRI Classification in Alzheimer Diagnosis
Exploring the application of deep learning technologies in the field of medical diagnostics, Magnetic Resonance Imaging (MRI) provides a unique perspective for observing and diagnosing complex neurodegenerative diseases such as Alzheimer Disease (AD). With advancements in deep learning, particularly in Convolutional Neural Networks (CNNs) and the Xception network architecture, we are now able to analyze and classify vast amounts of MRI data with unprecedented accuracy. The progress of this technology not only enhances our understanding of brain structural changes but also opens up new avenues for monitoring disease progression through non-invasive means and potentially allows for precise diagnosis in the early stages of the disease. This study aims to classify MRI images using deep learning models to identify different stages of Alzheimer Disease through a series of innovative data processing and model construction steps. Our experimental results show that the deep learning framework based on the Xception model achieved a 99.6% accuracy rate in the multi-class MRI image classification task, demonstrating its potential application value in assistive diagnosis. Future research will focus on expanding the dataset, improving model interpretability, and clinical validation to further promote the application of deep learning technology in the medical field, with the hope of bringing earlier diagnosis and more personalized treatment plans to Alzheimer Disease patients.
☆ Convergence analysis of OT-Flow for sample generation
Deep generative models aim to learn the underlying distribution of data and generate new ones. Despite the diversity of generative models and their high-quality generation performance in practice, most of them lack rigorous theoretical convergence proofs. In this work, we aim to establish some convergence results for OT-Flow, one of the deep generative models. First, by reformulating the framework of OT-Flow model, we establish the $\Gamma$-convergence of the formulation of OT-flow to the corresponding optimal transport (OT) problem as the regularization term parameter $\alpha$ goes to infinity. Second, since the loss function will be approximated by Monte Carlo method in training, we established the convergence between the discrete loss function and the continuous one when the sample number $N$ goes to infinity as well. Meanwhile, the approximation capability of the neural network provides an upper bound for the discrete loss function of the minimizers. The proofs in both aspects provide convincing assurances for OT-Flow.
☆ From Discrete to Continuous: Deep Fair Clustering With Transferable Representations
We consider the problem of deep fair clustering, which partitions data into clusters via the representations extracted by deep neural networks while hiding sensitive data attributes. To achieve fairness, existing methods present a variety of fairness-related objective functions based on the group fairness criterion. However, these works typically assume that the sensitive attributes are discrete and do not work for continuous sensitive variables, such as the proportion of the female population in an area. Besides, the potential of the representations learned from clustering tasks to improve performance on other tasks is ignored by existing works. In light of these limitations, we propose a flexible deep fair clustering method that can handle discrete and continuous sensitive attributes simultaneously. Specifically, we design an information bottleneck style objective function to learn fair and clustering-friendly representations. Furthermore, we explore for the first time the transferability of the extracted representations to other downstream tasks. Unlike existing works, we impose fairness at the representation level, which could guarantee fairness for the transferred task regardless of clustering results. To verify the effectiveness of the proposed method, we perform extensive experiments on datasets with discrete and continuous sensitive attributes, demonstrating the advantage of our method in comparison with state-of-the-art methods.
☆ Logic-based Explanations for Linear Support Vector Classifiers with Reject Option
Support Vector Classifier (SVC) is a well-known Machine Learning (ML) model for linear classification problems. It can be used in conjunction with a reject option strategy to reject instances that are hard to correctly classify and delegate them to a specialist. This further increases the confidence of the model. Given this, obtaining an explanation of the cause of rejection is important to not blindly trust the obtained results. While most of the related work has developed means to give such explanations for machine learning models, to the best of our knowledge none have done so for when reject option is present. We propose a logic-based approach with formal guarantees on the correctness and minimality of explanations for linear SVCs with reject option. We evaluate our approach by comparing it to Anchors, which is a heuristic algorithm for generating explanations. Obtained results show that our proposed method gives shorter explanations with reduced time cost.
comment: 16 pages, submitted to BRACIS 2023 (Brazilian Conference on Intelligent Systems), accepted version published in Intelligent Systems, LNCS, vol 14195
☆ Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals COLING 2024
Deep neural networks (DNNs) are notoriously vulnerable to adversarial attacks that place carefully crafted perturbations on normal examples to fool DNNs. To better understand such attacks, a characterization of the features carried by adversarial examples is needed. In this paper, we tackle this challenge by inspecting the subspaces of sample features through spectral analysis. We first empirically show that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap, and the classical low-dimensional subspace projection can suppress perturbation features out of the subspace of clean signals. This makes it possible for DNNs to learn a subspace where only features of clean signals exist while those of perturbations are discarded, which can facilitate the distinction of adversarial examples. To prevent the residual perturbations that is inevitable in subspace learning, we propose an independence criterion to disentangle clean signals from perturbations. Experimental results show that the proposed strategy enables the model to inherently suppress adversaries, which not only boosts model robustness but also motivates new directions of effective adversarial defense.
comment: Accepted by COLING 2024
☆ An Analytic Solution to Covariance Propagation in Neural Networks AISTATS 2024
Uncertainty quantification of neural networks is critical to measuring the reliability and robustness of deep learning systems. However, this often involves costly or inaccurate sampling methods and approximations. This paper presents a sample-free moment propagation technique that propagates mean vectors and covariance matrices across a network to accurately characterize the input-output distributions of neural networks. A key enabler of our technique is an analytic solution for the covariance of random variables passed through nonlinear activation functions, such as Heaviside, ReLU, and GELU. The wide applicability and merits of the proposed technique are shown in experiments analyzing the input-output distributions of trained neural networks and training Bayesian neural networks.
comment: Accepted to AISTATS 2024
☆ One Masked Model is All You Need for Sensor Fault Detection, Isolation and Accommodation IJCNN 2024
Accurate and reliable sensor measurements are critical for ensuring the safety and longevity of complex engineering systems such as wind turbines. In this paper, we propose a novel framework for sensor fault detection, isolation, and accommodation (FDIA) using masked models and self-supervised learning. Our proposed approach is a general time series modeling approach that can be applied to any neural network (NN) model capable of sequence modeling, and captures the complex spatio-temporal relationships among different sensors. During training, the proposed masked approach creates a random mask, which acts like a fault, for one or more sensors, making the training and inference task unified: finding the faulty sensors and correcting them. We validate our proposed technique on both a public dataset and a real-world dataset from GE offshore wind turbines, and demonstrate its effectiveness in detecting, diagnosing and correcting sensor faults. The masked model not only simplifies the overall FDIA pipeline, but also outperforms existing approaches. Our proposed technique has the potential to significantly improve the accuracy and reliability of sensor measurements in complex engineering systems in real-time, and could be applied to other types of sensors and engineering systems in the future. We believe that our proposed framework can contribute to the development of more efficient and effective FDIA techniques for a wide range of applications.
comment: Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)
☆ A Survey on Consumer IoT Traffic: Security and Privacy
For the past few years, the Consumer Internet of Things (CIoT) has entered public lives. While CIoT has improved the convenience of people's daily lives, it has also brought new security and privacy concerns. In this survey, we try to figure out what researchers can learn about the security and privacy of CIoT by traffic analysis, a popular method in the security community. From the security and privacy perspective, this survey seeks out the new characteristics in CIoT traffic analysis, the state-of-the-art progress in CIoT traffic analysis, and the challenges yet to be solved. We collected 310 papers from January 2018 to December 2023 related to CIoT traffic analysis from the security and privacy perspective and summarized the process of CIoT traffic analysis in which the new characteristics of CIoT are identified. Then, we detail existing works based on five application goals: device fingerprinting, user activity inference, malicious traffic analysis, security analysis, and measurement. At last, we discuss the new challenges and future research directions.
☆ Predicting Energy Budgets in Droplet Dynamics: A Recurrent Neural Network Approach
Neural networks in fluid mechanics offer an efficient approach for exploring complex flows, including multiphase and free surface flows. The recurrent neural network, particularly the Long Short-Term Memory (LSTM) model, proves attractive for learning mappings from transient inputs to dynamic outputs. This study applies LSTM to predict transient and static outputs for fluid flows under surface tension effects. Specifically, we explore two distinct droplet dynamic scenarios: droplets with diverse initial shapes impacting with solid surfaces, as well as the coalescence of two droplets following collision. Using only dimensionless numbers and geometric time series data from numerical simulations, LSTM predicts the energy budget. The marker-and-cell front-tracking methodology combined with a marker-and-cell finite-difference strategy is adopted for simulating the droplet dynamics. Using a recurrent neural network (RNN) architecture fed with time series data derived from geometrical parameters, as for example droplet diameter variation, our study shows the accuracy of our approach in predicting energy budgets, as for instance the kinetic, dissipation, and surface energy trends, across a range of Reynolds and Weber numbers in droplet dynamic problems. Finally, a two-phase sequential neural network using only geometric data, which is readily available in experimental settings, is employed to predict the energies and then use them to estimate static parameters, such as the Reynolds and Weber numbers. While our methodology has been primarily validated with simulation data, its adaptability to experimental datasets is a promising avenue for future exploration. We hope that our strategy can be useful for diverse applications, spanning from inkjet printing to combustion engines, where the prediction of energy budgets or dissipation energies is crucial.
☆ CFAT: Unleashing TriangularWindows for Image Super-resolution CVPR 2024
Transformer-based models have revolutionized the field of image super-resolution (SR) by harnessing their inherent ability to capture complex contextual features. The overlapping rectangular shifted window technique used in transformer architecture nowadays is a common practice in super-resolution models to improve the quality and robustness of image upscaling. However, it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses, we propose a non-overlapping triangular window technique that synchronously works with the rectangular one to mitigate boundary-level distortion and allows the model to access more unique sifting modes. In this paper, we propose a Composite Fusion Attention Transformer (CFAT) that incorporates triangular-rectangular window-based local attention with a channel-based global attention technique in image super-resolution. As a result, CFAT enables attention mechanisms to be activated on more image pixels and captures long-range, multi-scale features to improve SR performance. The extensive experimental results and ablation study demonstrate the effectiveness of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB performance improvement over other state-of-the-art SR architectures.
comment: Accepted to CVPR 2024
☆ A Survey on Self-Supervised Pre-Training of Graph Foundation Models: A Knowledge-Based Perspective
Graph self-supervised learning is now a go-to method for pre-training graph foundation models, including graph neural networks, graph transformers, and more recent large language model (LLM)-based graph models. There is a wide variety of knowledge patterns embedded in the structure and properties of graphs which may be used for pre-training, but we lack a systematic overview of self-supervised pre-training tasks from the perspective of graph knowledge. In this paper, we comprehensively survey and analyze the pre-training tasks of graph foundation models from a knowledge-based perspective, consisting of microscopic (nodes, links, etc) and macroscopic knowledge (clusters, global structure, etc). It covers a total of 9 knowledge categories and 25 pre-training tasks, as well as various downstream task adaptation strategies. Furthermore, an extensive list of the related papers with detailed metadata is provided at https://github.com/Newiz430/Pretext.
comment: Work in progress
☆ Complementary Recommendation in E-commerce: Definition, Approaches, and Future Directions
In recent years, complementary recommendation has received extensive attention in the e-commerce domain. In this paper, we comprehensively summarize and compare 34 representative studies conducted between 2009 and 2024. Firstly, we compare the data and methods used for modeling complementary relationships between products, including simple complementarity and more complex scenarios such as asymmetric complementarity, the coexistence of substitution and complementarity relationships between products, and varying degrees of complementarity between different pairs of products. Next, we classify and compare the models based on the research problems of complementary recommendation, such as diversity, personalization, and cold-start. Furthermore, we provide a comparative analysis of experimental results from different studies conducted on the same dataset, which helps identify the strengths and weaknesses of the research. Compared to previous surveys, this paper provides a more updated and comprehensive summary of the research, discusses future research directions, and contributes to the advancement of this field.
comment: 20 pages,9 figures
☆ SSHPool: The Separated Subgraph-based Hierarchical Pooling
In this paper, we develop a novel local graph pooling method, namely the Separated Subgraph-based Hierarchical Pooling (SSHPool), for graph classification. To this end, we commence by assigning the nodes of a sample graph into different clusters, resulting in a family of separated subgraphs. We individually employ a local graph convolution units as the local structure to further compress each subgraph into a coarsened node, transforming the original graph into a coarsened graph. Since these subgraphs are separated by different clusters and the structural information cannot be propagated between them, the local convolution operation can significantly avoid the over-smoothing problem arising in most existing Graph Neural Networks (GNNs). By hierarchically performing the proposed procedures on the resulting coarsened graph, the proposed SSHPool can effectively extract the hierarchical global feature of the original graph structure, encapsulating rich intrinsic structural characteristics. Furthermore, we develop an end-to-end GNN framework associated with the proposed SSHPool module for graph classification. Experimental results demonstrate the superior performance of the proposed model on real-world datasets, significantly outperforming state-of-the-art GNN methods in terms of the classification accuracies.
☆ Runtime Monitoring and Fault Detection for Neural Network-Controlled Systems
There is an emerging trend in applying deep learning methods to control complex nonlinear systems. This paper considers enhancing the runtime safety of nonlinear systems controlled by neural networks in the presence of disturbance and measurement noise. A robustly stable interval observer is designed to generate sound and precise lower and upper bounds for the neural network, nonlinear function, and system state. The obtained interval is utilised to monitor the real-time system safety and detect faults in the system outputs or actuators. An adaptive cruise control vehicular system is simulated to demonstrate effectiveness of the proposed design.
comment: Accepted to SAFEPROCESS 2024
☆ AKBR: Learning Adaptive Kernel-based Representations for Graph Classification
In this paper, we propose a new model to learn Adaptive Kernel-based Representations (AKBR) for graph classification. Unlike state-of-the-art R-convolution graph kernels that are defined by merely counting any pair of isomorphic substructures between graphs and cannot provide an end-to-end learning mechanism for the classifier, the proposed AKBR approach aims to define an end-to-end representation learning model to construct an adaptive kernel matrix for graphs. To this end, we commence by leveraging a novel feature-channel attention mechanism to capture the interdependencies between different substructure invariants of original graphs. The proposed AKBR model can thus effectively identify the structural importance of different substructures, and compute the R-convolution kernel between pairwise graphs associated with the more significant substructures specified by their structural attentions. Since each row of the resulting kernel matrix can be theoretically seen as the embedding vector of a sample graph, the proposed AKBR model is able to directly employ the resulting kernel matrix as the graph feature matrix and input it into the classifier for classification (i.e., the SoftMax layer), naturally providing an end-to-end learning architecture between the kernel computation as well as the classifier. Experimental results show that the proposed AKBR model outperforms existing state-of-the-art graph kernels and deep learning methods on standard graph benchmarks.
☆ A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Joint consideration of scheduling and adaptive parallelism offers great opportunities for improving the training efficiency of large models on heterogeneous GPU clusters. However, integrating adaptive parallelism into a cluster scheduler expands the cluster scheduling space. The new space is the product of the original scheduling space and the parallelism exploration space of adaptive parallelism (also a product of pipeline, data, and tensor parallelism). The exponentially enlarged scheduling space and ever-changing optimal parallelism plan from adaptive parallelism together result in the contradiction between low-overhead and accurate performance data acquisition for efficient cluster scheduling. This paper presents Crius, a training system for efficiently scheduling multiple large models with adaptive parallelism in a heterogeneous cluster. Crius proposes a novel scheduling granularity called Cell. It represents a job with deterministic resources and pipeline stages. The exploration space of Cell is shrunk to the product of only data and tensor parallelism, thus exposing the potential for accurate and low-overhead performance estimation. Crius then accurately estimates Cells and efficiently schedules training jobs. When a Cell is selected as a scheduling choice, its represented job runs with the optimal parallelism plan explored. Experimental results show that Crius reduces job completion time by up to 48.9% and schedules large models with up to 1.49x cluster throughput improvement.
Self-Supervised Multi-Frame Neural Scene Flow
Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability, revealing that its performance is inversely proportional to the number of input point clouds. This finding sheds light on NSFP's effectiveness in handling large-scale point cloud scene flow estimation tasks. Motivated by such theoretical insights, we further explore the improvement of scene flow estimation by leveraging historical point clouds across multiple frames, which inherently increases the number of point clouds. Consequently, we propose a simple and effective method for multi-frame point cloud scene flow estimation, along with a theoretical evaluation of its generalization abilities. Our analysis confirms that the proposed method maintains a limited generalization error, suggesting that adding multiple frames to the scene flow optimization process does not detract from its generalizability. Extensive experimental results on large-scale autonomous driving Waymo Open and Argoverse lidar datasets demonstrate that the proposed method achieves state-of-the-art performance.
☆ Opportunities and challenges in the application of large artificial intelligence models in radiology
Influenced by ChatGPT, artificial intelligence (AI) large models have witnessed a global upsurge in large model research and development. As people enjoy the convenience by this AI large model, more and more large models in subdivided fields are gradually being proposed, especially large models in radiology imaging field. This article first introduces the development history of large models, technical details, workflow, working principles of multimodal large models and working principles of video generation large models. Secondly, we summarize the latest research progress of AI large models in radiology education, radiology report generation, applications of unimodal and multimodal radiology. Finally, this paper also summarizes some of the challenges of large AI models in radiology, with the aim of better promoting the rapid revolution in the field of radiography.
☆ A Transformer approach for Electricity Price Forecasting
This paper presents a novel approach to electricity price forecasting (EPF) using a pure Transformer model. As opposed to other alternatives, no other recurrent network is used in combination to the attention mechanism. Hence, showing that the attention layer is enough for capturing the temporal patterns. The paper also provides fair comparison of the models using the open-source EPF toolbox and provide the code to enhance reproducibility and transparency in EPF research. The results show that the Transformer model outperforms traditional methods, offering a promising solution for reliable and sustainable power system operation.
comment: 7 pages
☆ A Multi-Label Dataset of French Fake News: Human and Machine Insights LREC
We present a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus. We then use the subjectivity analyzer VAGO, and a neural version of it, to clarify the link between ascriptions of the label Subjective and ascriptions of the label Fake News. The annotated dataset is available online at the following url: https://github.com/obs-info/obsinfox Keywords: Fake News, Multi-Labels, Subjectivity, Vagueness, Detail, Opinion, Exaggeration, French Press
comment: Paper to appear in the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
☆ LLMs as Compiler for Arabic Programming Language
In this paper we introduce APL (Arabic Programming Language) that uses Large language models (LLM) as semi-compiler to covert Arabic text code to python code then run the code. Designing a full pipeline from the structure of the APL text then a prompt (using prompt engineering) then running the prodcued python code using PyRunner. This project has a three parts first python library, a playground with simple interface and this research paper.
☆ IBCB: Efficient Inverse Batched Contextual Bandit for Behavioral Evolution History
Traditional imitation learning focuses on modeling the behavioral mechanisms of experts, which requires a large amount of interaction history generated by some fixed expert. However, in many streaming applications, such as streaming recommender systems, online decision-makers typically engage in online learning during the decision-making process, meaning that the interaction history generated by online decision-makers includes their behavioral evolution from novice expert to experienced expert. This poses a new challenge for existing imitation learning approaches that can only utilize data from experienced experts. To address this issue, this paper proposes an inverse batched contextual bandit (IBCB) framework that can efficiently perform estimations of environment reward parameters and learned policy based on the expert's behavioral evolution history. Specifically, IBCB formulates the inverse problem into a simple quadratic programming problem by utilizing the behavioral evolution history of the batched contextual bandit with inaccessible rewards. We demonstrate that IBCB is a unified framework for both deterministic and randomized bandit policies. The experimental results indicate that IBCB outperforms several existing imitation learning algorithms on synthetic and real-world data and significantly reduces running time. Additionally, empirical analyses reveal that IBCB exhibits better out-of-distribution generalization and is highly effective in learning the bandit policy from the interaction history of novice experts.
☆ Manifold Regularization Classification Model Based On Improved Diffusion Map
Manifold regularization model is a semi-supervised learning model that leverages the geometric structure of a dataset, comprising a small number of labeled samples and a large number of unlabeled samples, to generate classifiers. However, the original manifold norm limits the performance of models to local regions. To address this limitation, this paper proposes an approach to improve manifold regularization based on a label propagation model. We initially enhance the probability transition matrix of the diffusion map algorithm, which can be used to estimate the Neumann heat kernel, enabling it to accurately depict the label propagation process on the manifold. Using this matrix, we establish a label propagation function on the dataset to describe the distribution of labels at different time steps. Subsequently, we extend the label propagation function to the entire data manifold. We prove that the extended label propagation function converges to a stable distribution after a sufficiently long time and can be considered as a classifier. Building upon this concept, we propose a viable improvement to the manifold regularization model and validate its superiority through experiments.
comment: 20 pages, 24figures
♻ ☆ Ghost on the Shell: An Expressive Representation of General 3D Shapes ICLR 2024
The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parameterize open surfaces by defining a manifold signed distance field on watertight templates. With this parameterization, we further develop a grid-based and differentiable representation that parameterizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-Shell achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.
comment: ICLR 2024 Oral (v3: 30 pages, 19 figures, Project Page: https://gshell3d.github.io/)
♻ ☆ Development and Evaluation of a Learning-based Model for Real-time Haptic Texture Rendering
Current Virtual Reality (VR) environments lack the rich haptic signals that humans experience during real-life interactions, such as the sensation of texture during lateral movement on a surface. Adding realistic haptic textures to VR environments requires a model that generalizes to variations of a user's interaction and to the wide variety of existing textures in the world. Current methodologies for haptic texture rendering exist, but they usually develop one model per texture, resulting in low scalability. We present a deep learning-based action-conditional model for haptic texture rendering and evaluate its perceptual performance in rendering realistic texture vibrations through a multi part human user study. This model is unified over all materials and uses data from a vision-based tactile sensor (GelSight) to render the appropriate surface conditioned on the user's action in real time. For rendering texture, we use a high-bandwidth vibrotactile transducer attached to a 3D Systems Touch device. The result of our user study shows that our learning-based method creates high-frequency texture renderings with comparable or better quality than state-of-the-art methods without the need for learning a separate model per texture. Furthermore, we show that the method is capable of rendering previously unseen textures using a single GelSight image of their surface.
comment: Accepted for publication in IEEE Transactions on Haptics 2024. 12 pages, 8 figures
♻ ☆ VQPy: An Object-Oriented Approach to Modern Video Analytics
Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
comment: MLSys'24
♻ ☆ Assessing cognitive function among older adults using machine learning and wearable device data: a feasibility study
Timely implementation of interventions to slow cognitive decline among older adults requires accurate monitoring to detect changes in cognitive function. Data gathered using wearable devices that can continuously monitor factors known to be associated with cognition could be used to train machine learning models and develop wearable-based cognitive monitoring systems. Using data from over 2,400 older adults in the National Health and Nutrition Examination Survey (NHANES) we developed prediction models to differentiate older adults with normal cognition from those with poor cognition based on outcomes from three cognitive tests measuring different domains of cognitive function. During repeated cross-validation, CatBoost, XGBoost, and Random Forest models performed best when predicting cognition based on processing speed, working memory, and attention (median AUCs >0.82) compared to immediate and delayed recall (median AUCs >0.72) and categorical verbal fluency (median AUC >0.68). Activity and sleep parameters were also more strongly associated with processing speed, working memory, and attention compared to other cognitive subdomains. Our work provides proof of concept that wearable-based cognitive monitoring systems may be a viable alternative to traditional methods for monitoring processing speeds, working memory, and attention. We further identified novel metrics that could be targets in future causal studies seeking to better understand how sleep and activity parameters influence cognitive function among older adults.
♻ ☆ Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models ICLR 2024
Language models have shown promise in various tasks but can be affected by undesired data during training, fine-tuning, or alignment. For example, if some unsafe conversations are wrongly annotated as safe ones, the model fine-tuned on these samples may be harmful. Therefore, the correctness of annotations, i.e., the credibility of the dataset, is important. This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless & Red Team, PKU BeaverTails & SafeRLHF, that can be used for training a harmless language model. Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets, identifying label errors, and evaluating the influence of noisy labels in the curated language data, specifically focusing on unsafe comments and conversation classification. With the framework, we find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks. The data credibility and downstream learning performance can be remarkably improved by directly fixing label errors, indicating the significance of cleaning existing real-world datasets. We provide an open-source tool, Docta, for data cleaning at https://github.com/Docta-ai/docta.
comment: ICLR 2024
♻ ☆ Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting ICLR 2024
Federated Domain Adaptation (FDA) describes the federated learning (FL) setting where source clients and a server work collaboratively to improve the performance of a target client where limited data is available. The domain shift between the source and target domains, coupled with limited data of the target client, makes FDA a challenging problem, e.g., common techniques such as federated averaging and fine-tuning fail due to domain shift and data scarcity. To theoretically understand the problem, we introduce new metrics that characterize the FDA setting and a theoretical framework with novel theorems for analyzing the performance of server aggregation rules. Further, we propose a novel lightweight aggregation rule, Federated Gradient Projection ($\texttt{FedGP}$), which significantly improves the target performance with domain shift and data scarcity. Moreover, our theory suggests an $\textit{auto-weighting scheme}$ that finds the optimal combinations of the source and target gradients. This scheme improves both $\texttt{FedGP}$ and a simpler heuristic aggregation rule. Extensive experiments verify the theoretical insights and illustrate the effectiveness of the proposed methods in practice.
comment: ICLR 2024
♻ ☆ DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models ICLR 2024
Existing NAS methods suffer from either an excessive amount of time for repetitive sampling and training of many task-irrelevant architectures. To tackle such limitations of existing NAS methods, we propose a paradigm shift from NAS to a novel conditional Neural Architecture Generation (NAG) framework based on diffusion models, dubbed DiffusionNAG. Specifically, we consider the neural architectures as directed graphs and propose a graph diffusion model for generating them. Moreover, with the guidance of parameterized predictors, DiffusionNAG can flexibly generate task-optimal architectures with the desired properties for diverse tasks, by sampling from a region that is more likely to satisfy the properties. This conditional NAG scheme is significantly more efficient than previous NAS schemes which sample the architectures and filter them using the property predictors. We validate the effectiveness of DiffusionNAG through extensive experiments in two predictor-based NAS scenarios: Transferable NAS and Bayesian Optimization (BO)-based NAS. DiffusionNAG achieves superior performance with speedups of up to 35 times when compared to the baselines on Transferable NAS benchmarks. Furthermore, when integrated into a BO-based algorithm, DiffusionNAG outperforms existing BO-based NAS approaches, particularly in the large MobileNetV3 search space on the ImageNet 1K dataset. Code is available at https://github.com/CownowAn/DiffusionNAG.
comment: Accepted to ICLR 2024
♻ ☆ Latent Dataset Distillation with Diffusion Models
The efficacy of machine learning has traditionally relied on the availability of increasingly larger datasets. However, large datasets pose storage challenges and contain non-influential samples, which could be ignored during training without impacting the final accuracy of the model. In response to these limitations, the concept of distilling the information on a dataset into a condensed set of (synthetic) samples, namely a distilled dataset, emerged. One crucial aspect is the selected architecture (usually ConvNet) for linking the original and synthetic datasets. However, the final accuracy is lower if the employed model architecture differs from the model used during distillation. Another challenge is the generation of high-resolution images, e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation with Diffusion Models (LD3M) that combine diffusion in latent space with dataset distillation to tackle both challenges. LD3M incorporates a novel diffusion process tailored for dataset distillation, which improves the gradient norms for learning synthetic images. By adjusting the number of diffusion steps, LD3M also offers a straightforward way of controlling the trade-off between speed and accuracy. We evaluate our approach in several ImageNet subsets and for high-resolution images (128x128 and 256x256). As a result, LD3M consistently outperforms state-of-the-art distillation techniques by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.
♻ ☆ Stochastic Hessian Fittings on Lie Groups
This paper studies the fitting of Hessian or its inverse for stochastic optimizations using a Hessian fitting criterion from the preconditioned stochastic gradient descent (PSGD) method, which is intimately related to many commonly used second order and adaptive gradient optimizers, e.g., BFGS, Gaussian-Newton and natural gradient descent, AdaGrad, etc. Our analyses reveal the efficiency and reliability differences among a wide range of preconditioner fitting methods, from closed-form to iterative solutions, using Hessian-vector products or stochastic gradients only, with Hessian fittings in the Euclidean space, the manifold of symmetric positive definite (SPL) matrices, or a variety of Lie groups. The most intriguing discovery is that the Hessian fitting itself as an optimization problem is strongly convex under mild conditions on a specific yet general enough Lie group. This discovery turns Hessian fitting into a well behaved optimization problem, and facilitates the designs of highly efficient and elegant Lie group sparse preconditioner fitting methods for large scale stochastic optimizations.
♻ ☆ FedNMUT -- Federated Noisy Model Update Tracking Convergence Analysis
A novel Decentralized Noisy Model Update Tracking Federated Learning algorithm (FedNMUT) is proposed that is tailored to function efficiently in the presence of noisy communication channels that reflect imperfect information exchange. This algorithm uses gradient tracking to minimize the impact of data heterogeneity while minimizing communication overhead. The proposed algorithm incorporates noise into its parameters to mimic the conditions of noisy communication channels, thereby enabling consensus among clients through a communication graph topology in such challenging environments. FedNMUT prioritizes parameter sharing and noise incorporation to increase the resilience of decentralized learning systems against noisy communications. Theoretical results for the smooth non-convex objective function are provided by us, and it is shown that the $\epsilon-$stationary solution is achieved by our algorithm at the rate of $\mathcal{O}\left(\frac{1}{\sqrt{T}}\right)$, where $T$ is the total number of communication rounds. Additionally, via empirical validation, we demonstrated that the performance of FedNMUT is superior to the existing state-of-the-art methods and conventional parameter-mixing approaches in dealing with imperfect information sharing. This proves the capability of the proposed algorithm to counteract the negative effects of communication noise in a decentralized learning framework.
comment: arXiv admin note: text overlap with arXiv:2303.10695
♻ ☆ Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models ICLR 2024
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output-channel (per-OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs. Code is available at https://github.com/johnheo/adadim-llm
comment: ICLR 2024. 19 pages, 11 figures, 10 tables
♻ ☆ Detection of diabetic retinopathy using longitudinal self-supervised learning MICCAI
Longitudinal imaging is able to capture both static anatomical structures and dynamic changes in disease progression towards earlier and better patient-specific pathology management. However, conventional approaches for detecting diabetic retinopathy (DR) rarely take advantage of longitudinal information to improve DR analysis. In this work, we investigate the benefit of exploiting self-supervised learning with a longitudinal nature for DR diagnosis purposes. We compare different longitudinal self-supervised learning (LSSL) methods to model the disease progression from longitudinal retinal color fundus photographs (CFP) to detect early DR severity changes using a pair of consecutive exams. The experiments were conducted on a longitudinal DR screening dataset with or without those trained encoders (LSSL) acting as a longitudinal pretext task. Results achieve an AUC of 0.875 for the baseline (model trained from scratch) and an AUC of 0.96 (95% CI: 0.9593-0.9655 DeLong test) with a p-value < 2.2e-16 on early fusion using a simple ResNet alike architecture with frozen LSSL weights, suggesting that the LSSL latent space enables to encode the dynamic of DR progression.
comment: Accepted preprint for presentation at MICCAI-OMIA
♻ ☆ Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning
Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.
♻ ☆ Uncertainty Quantification of MLE for Entity Ranking with Covariates
This paper concerns with statistical estimation and inference for the ranking problems based on pairwise comparisons with additional covariate information such as the attributes of the compared items. Despite extensive studies, few prior literatures investigate this problem under the more realistic setting where covariate information exists. To tackle this issue, we propose a novel model, Covariate-Assisted Ranking Estimation (CARE) model, that extends the well-known Bradley-Terry-Luce (BTL) model, by incorporating the covariate information. Specifically, instead of assuming every compared item has a fixed latent score $\{\theta_i^*\}_{i=1}^n$, we assume the underlying scores are given by $\{\alpha_i^*+{x}_i^\top\beta^*\}_{i=1}^n$, where $\alpha_i^*$ and ${x}_i^\top\beta^*$ represent latent baseline and covariate score of the $i$-th item, respectively. We impose natural identifiability conditions and derive the $\ell_{\infty}$- and $\ell_2$-optimal rates for the maximum likelihood estimator of $\{\alpha_i^*\}_{i=1}^{n}$ and $\beta^*$ under a sparse comparison graph, using a novel `leave-one-out' technique (Chen et al., 2019) . To conduct statistical inferences, we further derive asymptotic distributions for the MLE of $\{\alpha_i^*\}_{i=1}^n$ and $\beta^*$ with minimal sample complexity. This allows us to answer the question whether some covariates have any explanation power for latent scores and to threshold some sparse parameters to improve the ranking performance. We improve the approximation method used in (Gao et al., 2021) for the BLT model and generalize it to the CARE model. Moreover, we validate our theoretical results through large-scale numerical studies and an application to the mutual fund stock holding dataset.
comment: 81 pages, 3 figures
♻ ☆ Lemur: Integrating Large Language Models in Automated Program Verification ICLR'24
The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks.
comment: Accepted at ICLR'24
♻ ☆ Deep Limit Order Book Forecasting
We exploit cutting-edge deep learning methodologies to explore the predictability of high-frequency Limit Order Book mid-price changes for a heterogeneous set of stocks traded on the NASDAQ exchange. In so doing, we release `LOBFrame', an open-source code base to efficiently process large-scale Limit Order Book data and quantitatively assess state-of-the-art deep learning models' forecasting capabilities. Our results are twofold. We demonstrate that the stocks' microstructural characteristics influence the efficacy of deep learning methods and that their high forecasting power does not necessarily correspond to actionable trading signals. We argue that traditional machine learning metrics fail to adequately assess the quality of forecasts in the Limit Order Book context. As an alternative, we propose an innovative operational framework that evaluates predictions' practicality by focusing on the probability of accurately forecasting complete transactions. This work offers academics and practitioners an avenue to make informed and robust decisions on the application of deep learning techniques, their scope and limitations, effectively exploiting emergent statistical properties of the Limit Order Book.
comment: 43 pages, 14 figures, 12 Tables
♻ ☆ C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion ICLR 2024
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.
comment: ICLR 2024
♻ ☆ Sparse joint shift in multinomial classification
Sparse joint shift (SJS) was recently proposed as a tractable model for general dataset shift which may cause changes to the marginal distributions of features and labels as well as the posterior probabilities and the class-conditional feature distributions. Fitting SJS for a target dataset without label observations may produce valid predictions of labels and estimates of class prior probabilities. We present new results on the transmission of SJS from sets of features to larger sets of features, a conditional correction formula for the class posterior probabilities under the target distribution, identifiability of SJS, and the relationship between SJS and covariate shift. In addition, we point out inconsistencies in the algorithms which were proposed for estimating the characteristics of SJS, as they could hamper the search for optimal solutions, and suggest potential improvements.
comment: 27 pages
♻ ☆ FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training VLDB 2024
A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the currently-available shortcuts involved. To address these limitations, we instead propose FreshGNN, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, FreshGNN is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 3.4x up to 20.5x and reduce the memory access by 59%, with less than 1% influence on test accuracy.
comment: Accepted by VLDB 2024
♻ ☆ EnduRL: Enhancing Safety, Stability, and Efficiency of Mixed Traffic Under Real-World Perturbations Via Reinforcement Learning
Human-driven vehicles (HVs) amplify naturally occurring perturbations in traffic, leading to congestion--a major contributor to increased fuel consumption, higher collision risks, and reduced road capacity utilization. While previous research demonstrates that Robot Vehicles (RVs) can be leveraged to mitigate these issues, most such studies rely on simulations with simplistic models of human car-following behaviors. In this work, we analyze real-world driving trajectories and extract a wide range of acceleration profiles. We then incorporates these profiles into simulations for training RVs to mitigate congestion. We evaluate the safety, efficiency, and stability of mixed traffic via comprehensive experiments conducted in two mixed traffic environments (Ring and Bottleneck) at various traffic densities, configurations, and RV penetration rates. The results show that under real-world perturbations, prior RV controllers experience performance degradation on all three objectives (sometimes even lower than 100% HVs). To address this, we introduce a reinforcement learning based RV that employs a congestion stage classifier to optimize the safety, efficiency, and stability of mixed traffic. Our RVs demonstrate significant improvements: safety by up to 66%, efficiency by up to 54%, and stability by up to 97%.
♻ ☆ Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators ICLR 2024
Recently, channel-independent methods have achieved state-of-the-art performance in multivariate time series (MTS) forecasting. Despite reducing overfitting risks, these methods miss potential opportunities in utilizing channel dependence for accurate predictions. We argue that there exist locally stationary lead-lag relationships between variates, i.e., some lagged variates may follow the leading indicators within a short time period. Exploiting such channel dependence is beneficial since leading indicators offer advance information that can be used to reduce the forecasting difficulty of the lagged variates. In this paper, we propose a new method named LIFT that first efficiently estimates leading indicators and their leading steps at each time step and then judiciously allows the lagged variates to utilize the advance information from leading indicators. LIFT plays as a plugin that can be seamlessly collaborated with arbitrary time series forecasting methods. Extensive experiments on six real-world datasets demonstrate that LIFT improves the state-of-the-art methods by 5.5% in average forecasting performance. Our code is available at https://github.com/SJTU-Quant/LIFT.
comment: Accepted to ICLR 2024
♻ ☆ Can Copyright be Reduced to Privacy?
There is a growing concern that generative AI models will generate outputs closely resembling the copyrighted materials for which they are trained. This worry has intensified as the quality and complexity of generative models have immensely improved, and the availability of extensive datasets containing copyrighted material has expanded. Researchers are actively exploring strategies to mitigate the risk of generating infringing samples, with a recent line of work suggesting to employ techniques such as differential privacy and other forms of algorithmic stability to provide guarantees on the lack of infringing copying. In this work, we examine whether such algorithmic stability techniques are suitable to ensure the responsible use of generative models without inadvertently violating copyright laws. We argue that while these techniques aim to verify the presence of identifiable information in datasets, thus being privacy-oriented, copyright law aims to promote the use of original works for the benefit of society as a whole, provided that no unlicensed use of protected expression occurred. These fundamental differences between privacy and copyright must not be overlooked. In particular, we demonstrate that while algorithmic stability may be perceived as a practical tool to detect copying, such copying does not necessarily constitute copyright infringement. Therefore, if adopted as a standard for detecting an establishing copyright infringement, algorithmic stability may undermine the intended objectives of copyright law.
♻ ☆ HyMNet: a Multimodal Deep Learning System for Hypertension Classification using Fundus Photographs and Cardiometabolic Risk Factors
In recent years, deep learning has shown promise in predicting hypertension (HTN) from fundus images. However, most prior research has primarily focused on analyzing a single type of data, which may not capture the full complexity of HTN risk. To address this limitation, this study introduces a multimodal deep learning (MMDL) system, dubbed HyMNet, which combines fundus images and cardiometabolic risk factors, specifically age and gender, to improve hypertension detection capabilities. Our MMDL system uses RETFound, a foundation model pre-trained on 1.6 million retinal images, for the fundus path and a fully connected neural network for the age and gender path. The two paths are jointly trained by concatenating the feature vectors from each path that are then fed into a fusion network. The system was trained on 5,016 retinal images from 1,243 individuals collected from the Saudi Ministry of National Guard Health Affairs. The results show that the multimodal model that integrates fundus images along with age and gender outperforms the unimodal system trained solely on fundus photographs, with an F1 score of 0.771 [0.747, 0.796], and 0.745 [0.719, 0.772] for hypertension detection, respectively. Additionally, we studied the effect underlying diabetes mellitus has on the model's predictive ability, concluding that diabetes is used as a confounding variable for distinguishing hypertensive cases. Our code and model weights are publicly available at https://github.com/MohammedSB/HyMNet.
♻ ☆ Evaluating Neighbor Explainability for Graph Neural Networks
Explainability in Graph Neural Networks (GNNs) is a new field growing in the last few years. In this publication we address the problem of determining how important is each neighbor for the GNN when classifying a node and how to measure the performance for this specific task. To do this, various known explainability methods are reformulated to get the neighbor importance and four new metrics are presented. Our results show that there is almost no difference between the explanations provided by gradient-based techniques in the GNN domain. In addition, many explainability techniques failed to identify important neighbors when GNNs without self-loops are used.
♻ ☆ Dynamical softassign and adaptive parameter tuning for graph matching
This paper studies a unified framework for graph matching problems called the constrained gradient method. Popular algorithms within this framework include graduated assignment (GA), integer projected fixed-point method (IPFP), and doubly stochastic projected fixed-point method (DSPFP). These algorithms differ from the step size parameter and constrained operator. Our contributed adaptive step size parameter can guarantee the underlying algorithms' convergence and enhance their efficiency and accuracy. A preliminary analysis suggests that the optimal step size parameter has a high probability of being 1 in fully connected graph matching. Secondly, we propose a dynamic strategy for softassign, a popular constrained operator, to address its sensitivity concerning nodes' cardinality and risk of overflow. Combining the adaptive step size parameter and the dynamical softassign, we propose a novel graph matching algorithm: the softassign constrained gradient method. Various experiments demonstrate that it is significantly faster than other state-of-the-art algorithms based on the constrained gradient method with improved accuracy.
♻ ☆ Inductive Knowledge Graph Completion with GNNs and Rules: An Analysis
The task of inductive knowledge graph completion requires models to learn inference patterns from a training graph, which can then be used to make predictions on a disjoint test graph. Rule-based methods seem like a natural fit for this task, but in practice they significantly underperform state-of-the-art methods based on Graph Neural Networks (GNNs), such as NBFNet. We hypothesise that the underperformance of rule-based methods is due to two factors: (i) implausible entities are not ranked at all and (ii) only the most informative path is taken into account when determining the confidence in a given link prediction answer. To analyse the impact of these factors, we study a number of variants of a rule-based approach, which are specifically aimed at addressing the aforementioned issues. We find that the resulting models can achieve a performance which is close to that of NBFNet. Crucially, the considered variants only use a small fraction of the evidence that NBFNet relies on, which means that they largely keep the interpretability advantage of rule-based methods. Moreover, we show that a further variant, which does look at the full KG, consistently outperforms NBFNet.
♻ ☆ A Multi-Scale Decomposition MLP-Mixer for Time Series Analysis VLDB 2024
Time series data, including univariate and multivariate ones, are characterized by unique composition and complex multi-scale temporal variations. They often require special consideration of decomposition and multi-scale modeling to analyze. Existing deep learning methods on this best fit to univariate time series only, and have not sufficiently considered sub-series modeling and decomposition completeness. To address these challenges, we propose MSD-Mixer, a Multi-Scale Decomposition MLP-Mixer, which learns to explicitly decompose and represent the input time series in its different layers. To handle the multi-scale temporal patterns and multivariate dependencies, we propose a novel temporal patching approach to model the time series as multi-scale patches, and employ MLPs to capture intra- and inter-patch variations and channel-wise correlations. In addition, we propose a novel loss function to constrain both the mean and the autocorrelation of the decomposition residual for better decomposition completeness. Through extensive experiments on various real-world datasets for five common time series analysis tasks, we demonstrate that MSD-Mixer consistently and significantly outperforms other state-of-the-art algorithms with better efficiency.
comment: Accepted for VLDB 2024
♻ ☆ Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models
The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/Point-PEFT.
comment: The specialized PEFT framework for 3D pre-trained models, which achieves competitive performance to full fine-tuning, and significantly reduces the computational resources. Project page: https://github.com/Ivan-Tang-3D/Point-PEFT
♻ ☆ A tutorial on learning from preferences and choices with Gaussian Processes
Preference modelling lies at the intersection of economics, decision theory, machine learning and statistics. By understanding individuals' preferences and how they make choices, we can build products that closely match their expectations, paving the way for more efficient and personalised applications across a wide range of domains. The objective of this tutorial is to present a cohesive and comprehensive framework for preference learning with Gaussian Processes (GPs), demonstrating how to seamlessly incorporate rationality principles (from economics and decision theory) into the learning process. By suitably tailoring the likelihood function, this framework enables the construction of preference learning models that encompass random utility models, limits of discernment, and scenarios with multiple conflicting utilities for both object- and label-preference. This tutorial builds upon established research while simultaneously introducing some novel GP-based models to address specific gaps in the existing literature.
♻ ☆ ALI-DPFL: Differentially Private Federated Learning with Adaptive Local Iterations
Federated Learning (FL) is a distributed machine learning technique that allows model training among multiple devices or organizations by sharing training parameters instead of raw data. However, adversaries can still infer individual information through inference attacks (e.g. differential attacks) on these training parameters. As a result, Differential Privacy (DP) has been widely used in FL to prevent such attacks. We consider differentially private federated learning in a resource-constrained scenario, where both privacy budget and communication rounds are constrained. By theoretically analyzing the convergence, we can find the optimal number of local DPSGD iterations for clients between any two sequential global updates. Based on this, we design an algorithm of Differentially Private Federated Learning with Adaptive Local Iterations (ALI-DPFL). We experiment our algorithm on the MNIST, FashionMNIST and Cifar10 datasets, and demonstrate significantly better performances than previous work in the resource-constraint scenario. Code is available at https://github.com/KnightWan/ALI-DPFL.
♻ ☆ Gradient-less Federated Gradient Boosting Trees with Learnable Learning Rates
The privacy-sensitive nature of decentralized datasets and the robustness of eXtreme Gradient Boosting (XGBoost) on tabular data raise the needs to train XGBoost in the context of federated learning (FL). Existing works on federated XGBoost in the horizontal setting rely on the sharing of gradients, which induce per-node level communication frequency and serious privacy concerns. To alleviate these problems, we develop an innovative framework for horizontal federated XGBoost which does not depend on the sharing of gradients and simultaneously boosts privacy and communication efficiency by making the learning rates of the aggregated tree ensembles learnable. We conduct extensive evaluations on various classification and regression datasets, showing our approach achieves performance comparable to the state-of-the-art method and effectively improves communication efficiency by lowering both communication rounds and communication overhead by factors ranging from 25x to 700x. Project Page: https://flower.ai/blog/2023-04-19-xgboost-with-flower/
comment: Accepted at the 3rd ACM Workshop on Machine Learning and Systems (EuroMLSys), May 8th 2023, Rome, Italy
♻ ☆ SynerMix: Synergistic Mixup Solution for Enhanced Intra-Class Cohesion and Inter-Class Separability in Image Classification
To address the issues of MixUp and its variants (e.g., Manifold MixUp) in image classification tasks-namely, their neglect of mixing within the same class (intra-class mixup) and their inadequacy in enhancing intra-class cohesion through their mixing operations-we propose a novel mixup method named SynerMix-Intra and, building upon this, introduce a synergistic mixup solution named SynerMix. SynerMix-Intra specifically targets intra-class mixup to bolster intra-class cohesion, a feature not addressed by current mixup methods. For each mini-batch, it leverages feature representations of unaugmented original images from each class to generate a synthesized feature representation through random linear interpolation. All synthesized representations are then fed into the classification and loss layers to calculate an average classification loss that significantly enhances intra-class cohesion. Furthermore, SynerMix combines SynerMix-Intra with an existing mixup approach (e.g., MixUp, Manifold MixUp), which primarily focuses on inter-class mixup and has the benefit of enhancing inter-class separability. In doing so, it integrates both inter- and intra-class mixup in a balanced way while concurrently improving intra-class cohesion and inter-class separability. Experimental results on six datasets show that SynerMix achieves a 0.1% to 3.43% higher accuracy than the best of either MixUp or SynerMix-Intra alone, averaging a 1.16% gain. It also surpasses the top-performer of either Manifold MixUp or SynerMix-Intra by 0.12% to 5.16%, with an average gain of 1.11%. Given that SynerMix is model-agnostic, it holds significant potential for application in other domains where mixup methods have shown promise, such as speech and text classification. Our code is publicly available at: https://github.com/wxitxy/synermix.git.
comment: 25 pages,12 figures
♻ ☆ Exponential Concentration in Stochastic Approximation
We analyze the behavior of stochastic approximation algorithms where iterates, in expectation, progress towards an objective at each step. When progress is proportional to the step size of the algorithm, we prove exponential concentration bounds. These tail-bounds contrast asymptotic normality results, which are more frequently associated with stochastic approximation. The methods that we develop rely on a geometric ergodicity proof. This extends a result on Markov chains due to Hajek (1982) to the area of stochastic approximation algorithms. We apply our results to several different Stochastic Approximation algorithms, specifically Projected Stochastic Gradient Descent, Kiefer-Wolfowitz and Stochastic Frank-Wolfe algorithms. When applicable, our results prove faster $O(1/t)$ and linear convergence rates for Projected Stochastic Gradient Descent with a non-vanishing gradient.
comment: 35 pages, 11 Figures
♻ ☆ Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood
Training energy-based models (EBMs) on high-dimensional data can be both challenging and time-consuming, and there exists a noticeable gap in sample quality between EBMs and other generative frameworks like GANs and diffusion models. To close this gap, inspired by the recent efforts of learning EBMs by maximizing diffusion recovery likelihood (DRL), we propose cooperative diffusion recovery likelihood (CDRL), an effective approach to tractably learn and sample from a series of EBMs defined on increasingly noisy versions of a dataset, paired with an initializer model for each EBM. At each noise level, the two models are jointly estimated within a cooperative training framework: samples from the initializer serve as starting points that are refined by a few MCMC sampling steps from the EBM. The EBM is then optimized by maximizing recovery likelihood, while the initializer model is optimized by learning from the difference between the refined samples and the initial samples. In addition, we made several practical designs for EBM training to further improve the sample quality. Combining these advances, our approach significantly boost the generation performance compared to existing EBM methods on CIFAR-10 and ImageNet datasets. We also demonstrate the effectiveness of our models for several downstream tasks, including classifier-free guided generation, compositional generation, image inpainting and out-of-distribution detection.
Cryptography and Security
☆ SoK: An Essential Guide For Using Malware Sandboxes In Security Applications: Challenges, Pitfalls, and Lessons Learned
Malware sandboxes provide many benefits for security applications, but they are complex. These complexities can overwhelm new users in different research areas and make it difficult to select, configure, and use sandboxes. Even worse, incorrectly using sandboxes can have a negative impact on security applications. In this paper, we address this knowledge gap by systematizing 84 representative papers for using x86/64 malware sandboxes in the academic literature. We propose a novel framework to simplify sandbox components and organize the literature to derive practical guidelines for using sandboxes. We evaluate the proposed guidelines systematically using three common security applications and demonstrate that the choice of different sandboxes can significantly impact the results. Specifically, our results show that the proposed guidelines improve the sandbox observable activities by at least 1.6x and up to 11.3x. Furthermore, we observe a roughly 25% improvement in accuracy, precision, and recall when using the guidelines to help with a malware family classification task. We conclude by affirming that there is no "silver bullet" sandbox deployment that generalizes, and we recommend that users apply our framework to define a scope for their analysis, a threat model, and derive context about how the sandbox artifacts will influence their intended use case. Finally, it is important that users document their experiment, limitations, and potential solutions for reproducibility
☆ Behind the (Digital Crime) Scenes: An MSC Model
Criminal investigations are inherently complex as they typically involve interactions among various actors like investigators, prosecutors, and defendants. The pervasive integration of technology in daily life adds an extra layer of complexity, especially in crimes that involve a digital element. The establishment of digital forensics as a foundational discipline for extracting digital evidence further exacerbates the complex nature of criminal investigations, leading to the proliferation of multiple scenarios. Recognising the need to structure standard operating procedures for the handling of digital evidence, the representation of digital forensics as a protocol emerges as a valuable opportunity to identify security and privacy threats. In this paper, we delineate the protocols that compose digital forensics within a criminal case, formalise them as message sequence charts (MSCs), and identify their functional requirements.
comment: Accepted in 12th International Symposium on Digital Forensics and Security (ISDFS 2024). 979-8-3503-3036-6/24/$31.00 copyright 2024 IEEE
☆ A Survey on Consumer IoT Traffic: Security and Privacy
For the past few years, the Consumer Internet of Things (CIoT) has entered public lives. While CIoT has improved the convenience of people's daily lives, it has also brought new security and privacy concerns. In this survey, we try to figure out what researchers can learn about the security and privacy of CIoT by traffic analysis, a popular method in the security community. From the security and privacy perspective, this survey seeks out the new characteristics in CIoT traffic analysis, the state-of-the-art progress in CIoT traffic analysis, and the challenges yet to be solved. We collected 310 papers from January 2018 to December 2023 related to CIoT traffic analysis from the security and privacy perspective and summarized the process of CIoT traffic analysis in which the new characteristics of CIoT are identified. Then, we detail existing works based on five application goals: device fingerprinting, user activity inference, malicious traffic analysis, security analysis, and measurement. At last, we discuss the new challenges and future research directions.
☆ Quantifying Arbitrage in Automated Market Makers: An Empirical Study of Ethereum ZK Rollups
Arbitrage can arise from the simultaneous purchase and sale of the same asset in different markets in order to profit from a difference in its price. This work systematically reviews arbitrage opportunities between Automated Market Makers (AMMs) on Ethereum ZK rollups, and Centralised Exchanges (CEXs). First, we propose a theoretical framework to measure such arbitrage opportunities and derive a formula for the related Maximal Arbitrage Value (MAV) that accounts for both price divergences and liquidity available in the trading venues. Then, we empirically measure the historical MAV available between SyncSwap, an AMM on zkSync Era, and Binance, and investigate how quickly misalignments in price are corrected against explicit and implicit market costs. Overall, the cumulative MAV from July to September 2023 on the USDC-ETH SyncSwap pool amounts to $104.96k (0.24% of trading volume).
comment: In proceedings
☆ Port Forwarding Services Are Forwarding Security Risks
We conduct the first comprehensive security study on representative port forwarding services (PFS), which emerge in recent years and make the web services deployed in internal networks available on the Internet along with better usability but less complexity compared to traditional techniques (e.g., NAT traversal techniques). Our study is made possible through a set of novel methodologies, which are designed to uncover the technical mechanisms of PFS, experiment attack scenarios for PFS protocols, automatically discover and snapshot port-forwarded websites (PFWs) at scale, and classify PFWs into well-observed categories. Leveraging these methodologies, we have observed the widespread adoption of PFS with millions of PFWs distributed across tens of thousands of ISPs worldwide. Furthermore, 32.31% PFWs have been classified into website categories that serve access to critical data or infrastructure, such as, web consoles for industrial control systems, IoT controllers, code repositories, and office automation systems. And 18.57% PFWs didn't enforce any access control for external visitors. Also identified are two types of attacks inherent in the protocols of Oray (one well-adopted PFS provider), and the notable abuse of PFSes by malicious actors in activities such as malware distribution, botnet operation and phishing.
☆ Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals COLING 2024
Deep neural networks (DNNs) are notoriously vulnerable to adversarial attacks that place carefully crafted perturbations on normal examples to fool DNNs. To better understand such attacks, a characterization of the features carried by adversarial examples is needed. In this paper, we tackle this challenge by inspecting the subspaces of sample features through spectral analysis. We first empirically show that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap, and the classical low-dimensional subspace projection can suppress perturbation features out of the subspace of clean signals. This makes it possible for DNNs to learn a subspace where only features of clean signals exist while those of perturbations are discarded, which can facilitate the distinction of adversarial examples. To prevent the residual perturbations that is inevitable in subspace learning, we propose an independence criterion to disentangle clean signals from perturbations. Experimental results show that the proposed strategy enables the model to inherently suppress adversaries, which not only boosts model robustness but also motivates new directions of effective adversarial defense.
comment: Accepted by COLING 2024
☆ Near-Optimal differentially private low-rank trace regression with guaranteed private initialization
We study differentially private (DP) estimation of a rank-$r$ matrix $M \in \mathbb{R}^{d_1\times d_2}$ under the trace regression model with Gaussian measurement matrices. Theoretically, the sensitivity of non-private spectral initialization is precisely characterized, and the differential-privacy-constrained minimax lower bound for estimating $M$ under the Schatten-$q$ norm is established. Methodologically, the paper introduces a computationally efficient algorithm for DP-initialization with a sample size of $n \geq \widetilde O (r^2 (d_1\vee d_2))$. Under certain regularity conditions, the DP-initialization falls within a local ball surrounding $M$. We also propose a differentially private algorithm for estimating $M$ based on Riemannian optimization (DP-RGrad), which achieves a near-optimal convergence rate with the DP-initialization and sample size of $n \geq \widetilde O(r (d_1 + d_2))$. Finally, the paper discusses the non-trivial gap between the minimax lower bound and the upper bound of low-rank matrix estimation under the trace regression model. It is shown that the estimator given by DP-RGrad attains the optimal convergence rate in a weaker notion of differential privacy. Our powerful technique for analyzing the sensitivity of initialization requires no eigengap condition between $r$ non-zero singular values.
♻ ☆ CCA-Secure Hybrid Encryption in Correlated Randomness Model and KEM Combiners
A hybrid encryption (HE) system is an efficient public key encryption system for arbitrarily long messages. An HE system consists of a public key component called key encapsulation mechanism (KEM), and a symmetric key component called data encapsulation mechanism (DEM). The HE encryption algorithm uses a KEM generated key k to encapsulate the message using DEM, and send the ciphertext together with the encapsulaton of k, to the decryptor who decapsulates k and uses it to decapsulate the message using the corresponding KEM and DEM components. The KEM/DEM composition theorem proves that if KEM and DEM satisfy well-defined security notions, then HE will be secure with well defined security. We introduce HE in correlated randomness model where the encryption and decryption algorithms have samples of correlated random variables that are partially leaked to the adversary. Security of the new KEM/DEM paradigm is defined against computationally unbounded or polynomially bounded adversaries. We define iKEM and cKEM with respective information theoretic computational security, and prove a composition theorem for them and a computationally secure DEM, resulting in secure HEs with proved computational security (CPA and CCA) and without any computational assumption. We construct two iKEMs that provably satisfy the required security notions of the composition theorem. The iKEMs are used to construct two efficient quantum-resistant HEs when used with an AES based DEM. We also define and construct combiners with proved security that combine the new KEM/DEM paradigm of HE with the traditional public key based paradigm of HE.
comment: On page 1, the extra comma (i.e. ",") in the title of the paper right after the name "Reihaneh Safavi-Naini" is removed in this revision
♻ ☆ Behavioral Authentication for Security and Safety
The issues of both system security and safety can be dissected integrally from the perspective of behavioral \emph{appropriateness}. That is, a system is secure or safe can be judged by whether the behavior of certain agent(s) is \emph{appropriate} or not. Specifically, a so-called \emph{appropriate behavior} involves the right agent performing the right actions at the right time under certain conditions. Then, according to different levels of appropriateness and degrees of custodies, behavioral authentication can be graded into three levels, i.e., the authentication of behavioral \emph{Identity}, \emph{Conformity}, and \emph{Benignity}. In a broad sense, for the security and safety issue, behavioral authentication is not only an innovative and promising method due to its inherent advantages but also a critical and fundamental problem due to the ubiquity of behavior generation and the necessity of behavior regulation in any system. By this classification, this review provides a comprehensive examination of the background and preliminaries of behavioral authentication. It further summarizes existing research based on their respective focus areas and characteristics. The challenges confronted by current behavioral authentication methods are analyzed, and potential research directions are discussed to promote the diversified and integrated development of behavioral authentication.
♻ ☆ zkTax: A pragmatic way to support zero-knowledge tax disclosures
Tax returns contain key financial information of interest to third parties: public officials are asked to share financial data for transparency, companies seek to assess the financial status of business partners, and individuals need to prove their income to landlords or to receive benefits. Tax returns also contain sensitive data such that sharing them in their entirety undermines privacy. We introduce a zero-knowledge tax disclosure system (zkTax) that allows individuals and organizations to make provable claims about select information in their tax returns without revealing additional information, which can be independently verified by third parties. The system consists of three distinct services that can be distributed: a tax authority provides tax documents signed with a public key; a Redact & Prove Service enables users to produce a redacted version of the tax documents with a zero-knowledge proof attesting the provenance of the redacted data; a Verify Service enables anyone to verify the proof. We implement a prototype with a user interface, compatible with U.S. tax forms, and demonstrate how this design could be implemented with minimal changes to existing tax infrastructure. Our system is designed to be extensible to other contexts and jurisdictions. This work provides a practical example of how distributed tools leveraging cryptography can enhance existing government or financial infrastructures, providing immediate transparency alongside privacy without system overhauls.
♻ ☆ ABC-Channel: An Advanced Blockchain-based Covert Channel
Establishing efficient and robust covert channels is crucial for secure communication within insecure network environments. With its inherent benefits of decentralization and anonymization, blockchain has gained considerable attention in developing covert channels. To guarantee a highly secure covert channel, channel negotiation should be contactless before the communication, carrier transaction features must be indistinguishable from normal transactions during the communication, and communication identities must be untraceable after the communication. Such a full-lifecycle covert channel is indispensable to defend against a versatile adversary who intercepts two communicating parties comprehensively (e.g., on-chain and off-chain). Unfortunately, it has not been thoroughly investigated in the literature. We make the first effort to achieve a full-lifecycle covert channel, a novel blockchain-based covert channel named ABC-Channel. We tackle a series of challenges, such as off-chain contact dependency, increased masquerading difficulties as growing transaction volume, and time-evolving, communicable yet untraceable identities, to achieve contactless channel negotiation, indistinguishable transaction features, and untraceable communication identities, respectively. We develop a working prototype to validate ABC-Channel and conduct extensive tests on the Bitcoin testnet. The experimental results demonstrate that ABC-Channel achieves substantially secure covert capabilities. In comparison to existing methods, it also exhibits state-of-the-art transmission efficiency.
comment: 5 pages, section 3.C; Corrected the description
♻ ☆ Can Copyright be Reduced to Privacy?
There is a growing concern that generative AI models will generate outputs closely resembling the copyrighted materials for which they are trained. This worry has intensified as the quality and complexity of generative models have immensely improved, and the availability of extensive datasets containing copyrighted material has expanded. Researchers are actively exploring strategies to mitigate the risk of generating infringing samples, with a recent line of work suggesting to employ techniques such as differential privacy and other forms of algorithmic stability to provide guarantees on the lack of infringing copying. In this work, we examine whether such algorithmic stability techniques are suitable to ensure the responsible use of generative models without inadvertently violating copyright laws. We argue that while these techniques aim to verify the presence of identifiable information in datasets, thus being privacy-oriented, copyright law aims to promote the use of original works for the benefit of society as a whole, provided that no unlicensed use of protected expression occurred. These fundamental differences between privacy and copyright must not be overlooked. In particular, we demonstrate that while algorithmic stability may be perceived as a practical tool to detect copying, such copying does not necessarily constitute copyright infringement. Therefore, if adopted as a standard for detecting an establishing copyright infringement, algorithmic stability may undermine the intended objectives of copyright law.
♻ ☆ Large Language Models for Blockchain Security: A Systematic Literature Review
Large Language Models (LLMs) have emerged as powerful tools in various domains involving blockchain security (BS). Several recent studies are exploring LLMs applied to BS. However, there remains a gap in our understanding regarding the full scope of applications, impacts, and potential constraints of LLMs on blockchain security. To fill this gap, we conduct a literature review on LLM4BS. As the first review of LLM's application on blockchain security, our study aims to comprehensively analyze existing research and elucidate how LLMs contribute to enhancing the security of blockchain systems. Through a thorough examination of scholarly works, we delve into the integration of LLMs into various aspects of blockchain security. We explore the mechanisms through which LLMs can bolster blockchain security, including their applications in smart contract auditing, identity verification, anomaly detection, vulnerable repair, and so on. Furthermore, we critically assess the challenges and limitations associated with leveraging LLMs for blockchain security, considering factors such as scalability, privacy concerns, and adversarial attacks. Our review sheds light on the opportunities and potential risks inherent in this convergence, providing valuable insights for researchers, practitioners, and policymakers alike.
♻ ☆ ALI-DPFL: Differentially Private Federated Learning with Adaptive Local Iterations
Federated Learning (FL) is a distributed machine learning technique that allows model training among multiple devices or organizations by sharing training parameters instead of raw data. However, adversaries can still infer individual information through inference attacks (e.g. differential attacks) on these training parameters. As a result, Differential Privacy (DP) has been widely used in FL to prevent such attacks. We consider differentially private federated learning in a resource-constrained scenario, where both privacy budget and communication rounds are constrained. By theoretically analyzing the convergence, we can find the optimal number of local DPSGD iterations for clients between any two sequential global updates. Based on this, we design an algorithm of Differentially Private Federated Learning with Adaptive Local Iterations (ALI-DPFL). We experiment our algorithm on the MNIST, FashionMNIST and Cifar10 datasets, and demonstrate significantly better performances than previous work in the resource-constraint scenario. Code is available at https://github.com/KnightWan/ALI-DPFL.
♻ ☆ REVERSIM: A Game-Based Environment to Study Human Aspects in Hardware Reverse Engineering
Hardware Reverse Engineering (HRE) is a technique for analyzing Integrated Circuits (ICs). Experts employ HRE for security-critical tasks, such as detecting Trojans or intellectual property violations. They rely not only on their experience and customized tools but also on their cognitive abilities. Conducting controlled experiments to assess the cognitive processes involved in HRE can open new avenues for hardware protection. However, HRE experts are largely unavailable for empirical research in real-world settings. To address this challenge, we have developed REVERSIM, a game-based environment that mimics realistic HRE subprocesses and can integrate standardized cognitive tests. REVERSIM enables quantitative studies with easier-to-recruit non-experts to uncover cognitive factors relevant to HRE, which can subsequently be validated with small expert samples. To evaluate the design of REVERSIM, the minimum requirements for successful participation, and its measurement capabilities, we conducted two studies: First, we performed semi-structured interviews with 14 professionals and researchers from the HRE domain, who attested to the comparability of REVERSIM to real-world HRE problems. Second, we conducted an online user study with 109 participants, demonstrating that they could engage in REVERSIM with low domain-specific prior knowledge. We provide refined screening criteria, derive fine-grained performance metrics, and successfully perform a cognitive test for mental speed in REVERSIM, thus contributing an important piece of the puzzle for the development of innovative hardware protection mechanisms.
Computation and Language
☆ Large Language Models in Biomedical and Health Informatics: A Bibliometric Review
Large Language Models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), enabling new ways to analyze data, treat patients, and conduct research. This bibliometric review aims to provide a panoramic view of how LLMs have been used in BHI by examining research articles and collaboration networks from 2022 to 2023. It further explores how LLMs can improve Natural Language Processing (NLP) applications in various BHI areas like medical diagnosis, patient engagement, electronic health record management, and personalized medicine. To do this, our bibliometric review identifies key trends, maps out research networks, and highlights major developments in this fast-moving field. Lastly, it discusses the ethical concerns and practical challenges of using LLMs in BHI, such as data privacy and reliable medical recommendations. Looking ahead, we consider how LLMs could further transform biomedical research as well as healthcare delivery and patient outcomes. This comprehensive review serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
comment: 50 pages, 7 figures, 4 tables
☆ LexDrafter: Terminology Drafting for Legislative Documents using Retrieval Augmented Generation LREC
With the increase in legislative documents at the EU, the number of new terms and their definitions is increasing as well. As per the Joint Practical Guide of the European Parliament, the Council and the Commission, terms used in legal documents shall be consistent, and identical concepts shall be expressed without departing from their meaning in ordinary, legal, or technical language. Thus, while drafting a new legislative document, having a framework that provides insights about existing definitions and helps define new terms based on a document's context will support such harmonized legal definitions across different regulations and thus avoid ambiguities. In this paper, we present LexDrafter, a framework that assists in drafting Definitions articles for legislative documents using retrieval augmented generation (RAG) and existing term definitions present in different legislative documents. For this, definition elements are built by extracting definitions from existing documents. Using definition elements and RAG, a Definitions article can be suggested on demand for a legislative document that is being drafted. We demonstrate and evaluate the functionality of LexDrafter using a collection of EU documents from the energy domain. The code for LexDrafter framework is available at https://github.com/achouhan93/LexDrafter.
comment: Accepted at LREC-COLING 2024
☆ Connecting the Dots: Inferring Patent Phrase Similarity with Retrieved Phrase Graphs NAACL 2024
We study the patent phrase similarity inference task, which measures the semantic similarity between two patent phrases. As patent documents employ legal and highly technical language, existing semantic textual similarity methods that use localized contextual information do not perform satisfactorily in inferring patent phrase similarity. To address this, we introduce a graph-augmented approach to amplify the global contextual information of the patent phrases. For each patent phrase, we construct a phrase graph that links to its focal patents and a list of patents that are either cited by or cite these focal patents. The augmented phrase embedding is then derived from combining its localized contextual embedding with its global embedding within the phrase graph. We further propose a self-supervised learning objective that capitalizes on the retrieved topology to refine both the contextualized embedding and the graph parameters in an end-to-end manner. Experimental results from a unique patent phrase similarity dataset demonstrate that our approach significantly enhances the representation of patent phrases, resulting in marked improvements in similarity inference in a self-supervised fashion. Substantial improvements are also observed in the supervised setting, underscoring the potential benefits of leveraging retrieved phrase graph augmentation.
comment: Findings of NAACL 2024
☆ Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling LREC
Topic modelling, as a well-established unsupervised technique, has found extensive use in automatically detecting significant topics within a corpus of documents. However, classic topic modelling approaches (e.g., LDA) have certain drawbacks, such as the lack of semantic understanding and the presence of overlapping topics. In this work, we investigate the untapped potential of large language models (LLMs) as an alternative for uncovering the underlying topics within extensive text corpora. To this end, we introduce a framework that prompts LLMs to generate topics from a given set of documents and establish evaluation protocols to assess the clustering efficacy of LLMs. Our findings indicate that LLMs with appropriate prompts can stand out as a viable alternative, capable of generating relevant topic titles and adhering to human guidelines to refine and merge topics. Through in-depth experiments and evaluation, we summarise the advantages and constraints of employing LLMs in topic extraction.
comment: Accepted at LREC-COLING 2024
☆ Improving Sequence-to-Sequence Models for Abstractive Text Summarization Using Meta Heuristic Approaches
As human society transitions into the information age, reduction in our attention span is a contingency, and people who spend time reading lengthy news articles are decreasing rapidly and the need for succinct information is higher than ever before. Therefore, it is essential to provide a quick overview of important news by concisely summarizing the top news article and the most intuitive headline. When humans try to make summaries, they extract the essential information from the source and add useful phrases and grammatical annotations from the original extract. Humans have a unique ability to create abstractions. However, automatic summarization is a complicated problem to solve. The use of sequence-to-sequence (seq2seq) models for neural abstractive text summarization has been ascending as far as prevalence. Numerous innovative strategies have been proposed to develop the current seq2seq models further, permitting them to handle different issues like saliency, familiarity, and human lucidness and create excellent synopses. In this article, we aimed toward enhancing the present architectures and models for abstractive text summarization. The modifications have been aimed at fine-tuning hyper-parameters, attempting specific encoder-decoder combinations. We examined many experiments on an extensively used CNN/DailyMail dataset to check the effectiveness of various models.
☆ SQL-Encoder: Improving NL2SQL In-Context Learning Through a Context-Aware Encoder
Detecting structural similarity between queries is essential for selecting examples in in-context learning models. However, assessing structural similarity based solely on the natural language expressions of queries, without considering SQL queries, presents a significant challenge. This paper explores the significance of this similarity metric and proposes a model for accurately estimating it. To achieve this, we leverage a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model. Our comprehensive evaluation demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics. Notably, our model outperforms strong competitive embedding models from OpenAI and Cohere. Furthermore, compared to these competitive models, our proposed encoder enhances the downstream performance of NL2SQL models in 1-shot in-context learning scenarios by 1-2\% for GPT-3.5-turbo, 4-8\% for CodeLlama-7B, and 2-3\% for CodeLlama-13B.
☆ ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models NAACL-2024
Parameter-efficient fine-tuning (PEFT) is widely studied for its effectiveness and efficiency in the era of large language models. Low-rank adaptation (LoRA) has demonstrated commendable performance as a popular and representative method. However, it is implemented with a fixed intrinsic rank that might not be the ideal setting for the downstream tasks. Recognizing the need for more flexible downstream task adaptation, we extend the methodology of LoRA to an innovative approach we call allocating low-rank adaptation (ALoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank. Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks. We have conducted experiments on various tasks, and the experimental results demonstrate that our ALoRA method can outperform the recent baselines with comparable tunable parameters.
comment: Accepted by NAACL-2024
☆ Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals COLING 2024
Deep neural networks (DNNs) are notoriously vulnerable to adversarial attacks that place carefully crafted perturbations on normal examples to fool DNNs. To better understand such attacks, a characterization of the features carried by adversarial examples is needed. In this paper, we tackle this challenge by inspecting the subspaces of sample features through spectral analysis. We first empirically show that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap, and the classical low-dimensional subspace projection can suppress perturbation features out of the subspace of clean signals. This makes it possible for DNNs to learn a subspace where only features of clean signals exist while those of perturbations are discarded, which can facilitate the distinction of adversarial examples. To prevent the residual perturbations that is inevitable in subspace learning, we propose an independence criterion to disentangle clean signals from perturbations. Experimental results show that the proposed strategy enables the model to inherently suppress adversaries, which not only boosts model robustness but also motivates new directions of effective adversarial defense.
comment: Accepted by COLING 2024
☆ Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs.
☆ Korean Bio-Medical Corpus (KBMC) for Medical Named Entity Recognition
Named Entity Recognition (NER) plays a pivotal role in medical Natural Language Processing (NLP). Yet, there has not been an open-source medical NER dataset specifically for the Korean language. To address this, we utilized ChatGPT to assist in constructing the KBMC (Korean Bio-Medical Corpus), which we are now presenting to the public. With the KBMC dataset, we noticed an impressive 20% increase in medical NER performance compared to models trained on general Korean NER datasets. This research underscores the significant benefits and importance of using specialized tools and datasets, like ChatGPT, to enhance language processing in specialized fields such as healthcare.
☆ What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?
We investigate the behavior of methods that use linear projections to remove information about a concept from a language representation, and we consider the question of what happens to a dataset transformed by such a method. A theoretical analysis and experiments on real-world and synthetic data show that these methods inject strong statistical dependencies into the transformed datasets. After applying such a method, the representation space is highly structured: in the transformed space, an instance tends to be located near instances of the opposite label. As a consequence, the original labeling can in some cases be reconstructed by applying an anti-clustering method.
☆ A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish
Large Language Models (LLMs) are trained on massive web-crawled corpora. This poses risks of leakage, including personal information, copyrighted texts, and benchmark datasets. Such leakage leads to undermining human trust in AI due to potential unauthorized generation of content or overestimation of performance. We establish the following three criteria concerning the leakage issues: (1) leakage rate: the proportion of leaked data in training data, (2) output rate: the ease of generating leaked data, and (3) detection rate: the detection performance of leaked versus non-leaked data. Despite the leakage rate being the origin of data leakage issues, it is not understood how it affects the output rate and detection rate. In this paper, we conduct an experimental survey to elucidate the relationship between the leakage rate and both the output rate and detection rate for personal information, copyrighted texts, and benchmark data. Additionally, we propose a self-detection approach that uses few-shot learning in which LLMs detect whether instances are present or absent in their training data, in contrast to previous methods that do not employ explicit learning. To explore the ease of generating leaked information, we create a dataset of prompts designed to elicit personal information, copyrighted text, and benchmarks from LLMs. Our experiments reveal that LLMs produce leaked information in most cases despite less such data in their training set. This indicates even small amounts of leaked data can greatly affect outputs. Our self-detection method showed superior performance compared to existing detection methods.
☆ A Survey on Lexical Ambiguity Detection and Word Sense Disambiguation SP
This paper explores techniques that focus on understanding and resolving ambiguity in language within the field of natural language processing (NLP), highlighting the complexity of linguistic phenomena such as polysemy and homonymy and their implications for computational models. Focusing extensively on Word Sense Disambiguation (WSD), it outlines diverse approaches ranging from deep learning techniques to leveraging lexical resources and knowledge graphs like WordNet. The paper introduces cutting-edge methodologies like word sense extension (WSE) and neuromyotonic approaches, enhancing disambiguation accuracy by predicting new word senses. It examines specific applications in biomedical disambiguation and language specific optimisation and discusses the significance of cognitive metaphors in discourse analysis. The research identifies persistent challenges in the field, such as the scarcity of sense annotated corpora and the complexity of informal clinical texts. It concludes by suggesting future directions, including using large language models, visual WSD, and multilingual WSD systems, emphasising the ongoing evolution in addressing lexical complexities in NLP. This thinking perspective highlights the advancement in this field to enable computers to understand language more accurately.
comment: 6 pages, 5 figures, 3 tables, Accepted by 20th IEEE International Colloquium on Signal Processing & its Applications (CSPA 2024)
☆ WangchanLion and WangchanX MRC Eval
This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publically release all training data, code, and the final model weights under the Apache-2 license. To assess the contextual understanding capability, we conducted extensive experimental studies using two Thai MRC datasets, XQuAD and Iapp_wiki_qa_squad. Experimental results demonstrate the model's ability to comprehend the context and produce an answer faithful to the reference one in 0-shot and 1-shot settings. In addition, our evaluation goes beyond the traditional MRC. We propose a new evaluation scheme assessing the answer's correctness, helpfulness, conciseness, and contextuality. Evaluation results provide insight into how we can improve our model in the future. Our code is public at https://github.com/vistec-AI/WangchanLion.
☆ A Multi-Label Dataset of French Fake News: Human and Machine Insights LREC
We present a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus. We then use the subjectivity analyzer VAGO, and a neural version of it, to clarify the link between ascriptions of the label Subjective and ascriptions of the label Fake News. The annotated dataset is available online at the following url: https://github.com/obs-info/obsinfox Keywords: Fake News, Multi-Labels, Subjectivity, Vagueness, Detail, Opinion, Exaggeration, French Press
comment: Paper to appear in the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
☆ LLMs as Compiler for Arabic Programming Language
In this paper we introduce APL (Arabic Programming Language) that uses Large language models (LLM) as semi-compiler to covert Arabic text code to python code then run the code. Designing a full pipeline from the structure of the APL text then a prompt (using prompt engineering) then running the prodcued python code using PyRunner. This project has a three parts first python library, a playground with simple interface and this research paper.
☆ Argument Quality Assessment in the Age of Instruction-Following Large Language Models LREC
The computational treatment of arguments on controversial issues has been subject to extensive NLP research, due to its envisioned impact on opinion formation, decision making, writing education, and the like. A critical task in any such application is the assessment of an argument's quality - but it is also particularly challenging. In this position paper, we start from a brief survey of argument quality research, where we identify the diversity of quality notions and the subjectiveness of their perception as the main hurdles towards substantial progress on argument quality assessment. We argue that the capabilities of instruction-following large language models (LLMs) to leverage knowledge across contexts enable a much more reliable assessment. Rather than just fine-tuning LLMs towards leaderboard chasing on assessment tasks, they need to be instructed systematically with argumentation theories and scenarios as well as with ways to solve argument-related problems. We discuss the real-world opportunities and ethical issues emerging thereby.
comment: Accepted to LREC-COLING 2024
☆ Qibo: A Large Language Model for Traditional Chinese Medicine
In the field of Artificial Intelligence, Large Language Models (LLMs) have demonstrated significant advances in user intent understanding and response in a number of specialized domains, including medicine, law, and finance. However, in the unique domain of traditional Chinese medicine (TCM), the performance enhancement of LLMs is challenged by the essential differences between its theories and modern medicine, as well as the lack of specialized corpus resources. In this paper, we aim to construct and organize a professional corpus in the field of TCM, to endow the large model with professional knowledge that is characteristic of TCM theory, and to successfully develop the Qibo model based on LLaMA, which is the first LLM in the field of TCM to undergo a complete training process from pre-training to Supervised Fine-Tuning (SFT). Furthermore, we develop the Qibo-benchmark, a specialized tool for evaluating the performance of LLMs, which is a specialized tool for evaluating the performance of LLMs in the TCM domain. This tool will provide an important basis for quantifying and comparing the understanding and application capabilities of different models in the field of traditional Chinese medicine, and provide guidance for future research directions and practical applications of intelligent assistants for traditional Chinese medicine. Finally, we conducted sufficient experiments to prove that Qibo has good performance in the field of traditional Chinese medicine.
☆ Monotonic Paraphrasing Improves Generalization of Language Model Prompting
Performance of large language models (LLMs) may vary with different prompts or instructions of even the same task. One commonly recognized factor for this phenomenon is the model's familiarity with the given prompt or instruction, which is typically estimated by its perplexity. However, finding the prompt with the lowest perplexity is challenging, given the enormous space of possible prompting phrases. In this paper, we propose monotonic paraphrasing (MonoPara), an end-to-end decoding strategy that paraphrases given prompts or instructions into their lower perplexity counterparts based on an ensemble of a paraphrase LM for prompt (or instruction) rewriting, and a target LM (i.e. the prompt or instruction executor) that constrains the generation for lower perplexity. The ensemble decoding process can efficiently paraphrase the original prompt without altering its semantic meaning, while monotonically decreasing the perplexity of each generation as calculated by the target LM. We explore in detail both greedy and search-based decoding as two alternative decoding schemes of MonoPara. Notably, MonoPara does not require any training and can monotonically lower the perplexity of the paraphrased prompt or instruction, leading to improved performance of zero-shot LM prompting as evaluated on a wide selection of tasks. In addition, MonoPara is also shown to effectively improve LMs' generalization on perturbed and unseen task instructions.
☆ Node Classification via Semantic-Structural Attention-Enhanced Graph Convolutional Networks
Graph data, also known as complex network data, is omnipresent across various domains and applications. Prior graph neural network models primarily focused on extracting task-specific structural features through supervised learning objectives, but they fell short in capturing the inherent semantic and structural features of the entire graph. In this paper, we introduce the semantic-structural attention-enhanced graph convolutional network (SSA-GCN), which not only models the graph structure but also extracts generalized unsupervised features to enhance vertex classification performance. The SSA-GCN's key contributions lie in three aspects: firstly, it derives semantic information through unsupervised feature extraction from a knowledge graph perspective; secondly, it obtains structural information through unsupervised feature extraction from a complex network perspective; and finally, it integrates these features through a cross-attention mechanism. By leveraging these features, we augment the graph convolutional network, thereby enhancing the model's generalization capabilities. Our experiments on the Cora and CiteSeer datasets demonstrate the performance improvements achieved by our proposed method. Furthermore, our approach also exhibits excellent accuracy under privacy settings, making it a robust and effective solution for graph data analysis.
☆ CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering COLING 2024
The recent advancements in artificial intelligence highlight the potential of language models in psychological health support. While models trained on data from mental health service platform have achieved preliminary success, challenges persist in areas such as data scarcity, quality, and ensuring a solid foundation in psychological techniques. To address these challenges, this study introduces a novel approach to enhance the precision and efficacy of psychological support through large language models. Specifically, we design a specific prompt derived from principles of Cognitive Behavioral Therapy (CBT) and have generated the CBT QA dataset, specifically for Chinese psychological health Q&A based on CBT structured intervention strategies. Unlike previous methods, our dataset emphasizes professional and structured response. Utilizing this dataset, we fine-tuned the large language model, giving birth to CBT-LLM, the large-scale language model specifically designed for Cognitive Behavioral Therapy techniques. Empirical evaluations demonstrate that CBT-LLM excels in generating structured, professional, and highly relevant responses in psychological health support tasks, showcasing its practicality and quality. The model is available on Hugging Face: https://huggingface.co/Hongbin37/CBT-LLM.
comment: Accepted at COLING 2024
☆ BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval
The burgeoning integration of 3D medical imaging into healthcare has led to a substantial increase in the workload of medical professionals. To assist clinicians in their diagnostic processes and alleviate their workload, the development of a robust system for retrieving similar case studies presents a viable solution. While the concept holds great promise, the field of 3D medical text-image retrieval is currently limited by the absence of robust evaluation benchmarks and curated datasets. To remedy this, our study presents a groundbreaking dataset, BIMCV-R (This dataset will be released upon acceptance.), which includes an extensive collection of 8,069 3D CT volumes, encompassing over 2 million slices, paired with their respective radiological reports. Expanding upon the foundational work of our dataset, we craft a retrieval strategy, MedFinder. This approach employs a dual-stream network architecture, harnessing the potential of large language models to advance the field of medical image retrieval beyond existing text-image retrieval solutions. It marks our preliminary step towards developing a system capable of facilitating text-to-image, image-to-text, and keyword-based retrieval tasks.
♻ ☆ VQPy: An Object-Oriented Approach to Modern Video Analytics
Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
comment: MLSys'24
♻ ☆ Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models ICLR 2024
Language models have shown promise in various tasks but can be affected by undesired data during training, fine-tuning, or alignment. For example, if some unsafe conversations are wrongly annotated as safe ones, the model fine-tuned on these samples may be harmful. Therefore, the correctness of annotations, i.e., the credibility of the dataset, is important. This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless & Red Team, PKU BeaverTails & SafeRLHF, that can be used for training a harmless language model. Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets, identifying label errors, and evaluating the influence of noisy labels in the curated language data, specifically focusing on unsafe comments and conversation classification. With the framework, we find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks. The data credibility and downstream learning performance can be remarkably improved by directly fixing label errors, indicating the significance of cleaning existing real-world datasets. We provide an open-source tool, Docta, for data cleaning at https://github.com/Docta-ai/docta.
comment: ICLR 2024
♻ ☆ Unlocking the Potential of ChatGPT: A Comprehensive Exploration of its Applications, Advantages, Limitations, and Future Directions in Natural Language Processing
Large language models have revolutionized the field of artificial intelligence and have been used in various applications. Among these models, ChatGPT (Chat Generative Pre-trained Transformer) has been developed by OpenAI, it stands out as a powerful tool that has been widely adopted. ChatGPT has been successfully applied in numerous areas, including chatbots, content generation, language translation, personalized recommendations, and even medical diagnosis and treatment. Its success in these applications can be attributed to its ability to generate human-like responses, understand natural language, and adapt to different contexts. Its versatility and accuracy make it a powerful tool for natural language processing (NLP). However, there are also limitations to ChatGPT, such as its tendency to produce biased responses and its potential to perpetuate harmful language patterns. This article provides a comprehensive overview of ChatGPT, its applications, advantages, and limitations. Additionally, the paper emphasizes the importance of ethical considerations when using this robust tool in real-world scenarios. Finally, This paper contributes to ongoing discussions surrounding artificial intelligence and its impact on vision and NLP domains by providing insights into prompt engineering techniques.
♻ ☆ Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science LREC
Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.
comment: Accepted at LREC-COLING 2024
♻ ☆ C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion ICLR 2024
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.
comment: ICLR 2024
♻ ☆ CLEX: Continuous Length Extrapolation for Large Language Models ICLR 2024
Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX.
comment: ICLR 2024
♻ ☆ Examining Temporalities on Stance Detection towards COVID-19 Vaccination LREC
Previous studies have highlighted the importance of vaccination as an effective strategy to control the transmission of the COVID-19 virus. It is crucial for policymakers to have a comprehensive understanding of the public's stance towards vaccination on a large scale. However, attitudes towards COVID-19 vaccination, such as pro-vaccine or vaccine hesitancy, have evolved over time on social media. Thus, it is necessary to account for possible temporal shifts when analysing these stances. This study aims to examine the impact of temporal concept drift on stance detection towards COVID-19 vaccination on Twitter. To this end, we evaluate a range of transformer-based models using chronological (splitting the training, validation, and test sets in order of time) and random splits (randomly splitting these three sets) of social media data. Our findings reveal significant discrepancies in model performance between random and chronological splits in several existing COVID-19-related datasets; specifically, chronological splits significantly reduce the accuracy of stance classification. Therefore, real-world stance detection approaches need to be further refined to incorporate temporal factors as a key consideration.
comment: Accepted at LREC-COLING 2024
♻ ☆ HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization LREC
Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at \url{https://github.com/FloatAI/humaneval-xl}.
comment: LREC-COLING 2024
♻ ☆ Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face
We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the seamless construction and deployment of interactive search engines. Spacerini makes state-of-the-art sparse and dense retrieval models more accessible to non-IR practitioners while minimizing deployment effort. This is useful for NLP researchers who want to better understand and validate their research by performing qualitative analyses of training corpora, for IR researchers who want to demonstrate new retrieval models integrated into the growing Pyserini ecosystem, and for third parties reproducing the work of other researchers. Spacerini is open source and includes utilities for loading, preprocessing, indexing, and deploying search engines locally and remotely. We demonstrate a portfolio of 13 search engines created with Spacerini for different use cases.
♻ ☆ PWESuite: Phonetic Word Embeddings and Tasks They Facilitate LREC
Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.
comment: LREC-COLING 2024
♻ ☆ From Graph to Word Bag: Introducing Domain Knowledge to Confusing Charge Prediction
Confusing charge prediction is a challenging task in legal AI, which involves predicting confusing charges based on fact descriptions. While existing charge prediction methods have shown impressive performance, they face significant challenges when dealing with confusing charges, such as Snatch and Robbery. In the legal domain, constituent elements play a pivotal role in distinguishing confusing charges. Constituent elements are fundamental behaviors underlying criminal punishment and have subtle distinctions among charges. In this paper, we introduce a novel From Graph to Word Bag (FWGB) approach, which introduces domain knowledge regarding constituent elements to guide the model in making judgments on confusing charges, much like a judge's reasoning process. Specifically, we first construct a legal knowledge graph containing constituent elements to help select keywords for each charge, forming a word bag. Subsequently, to guide the model's attention towards the differentiating information for each charge within the context, we expand the attention mechanism and introduce a new loss function with attention supervision through words in the word bag. We construct the confusing charges dataset from real-world judicial documents. Experiments demonstrate the effectiveness of our method, especially in maintaining exceptional performance in imbalanced label distributions.
♻ ☆ UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
Urban region profiling from web-sourced data is of utmost importance for urban planning and sustainable development. We are witnessing a rising trend of LLMs for various fields, especially dealing with multi-modal data research such as vision-language learning, where the text modality serves as a supplement information for the image. Since textual modality has never been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions in this paper: i) Can textual modality enhance urban region profiling? ii) and if so, in what ways and with regard to which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of textual modality into urban imagery profiling, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP). Specifically, it first generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. Then, the model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on predicting three urban indicators in four major Chinese metropolises demonstrate its superior performance, with an average improvement of 6.1% on R^2 compared to the state-of-the-art methods. Our code and the image-language dataset will be released upon paper notification.
comment: Accepted by The Web Conference 2024
♻ ☆ Leveraging Large Language Models for Enhanced NLP Task Performance through Knowledge Distillation and Optimized Training Strategies
Emerging Large Language Models (LLMs) like GPT-4 have revolutionized Natural Language Processing (NLP), showing potential in traditional tasks such as Named Entity Recognition (NER). Our study explores a three-phase training strategy that harnesses GPT-4's capabilities to enhance the BERT model's performance on NER. Initially, GPT-4 annotates a subset of the CONLL2003 and additional BBC dataset without fine-tuning. We then train BERT using a mix of original and LLM-annotated data, analyzing the efficacy of LLM annotations against traditional methods. The second phase involves comparative experiments with different training regimens, assessing the synergy between distilled and original data. We observe that sequential strategies, particularly a simple mix of training first with distilled data followed by original data, significantly boost performance. In the third phase, we investigate various data blending techniques, including sigmoid and power decay functions, to optimize the training process further. Our results indicate that a strategic mix of distilled and original data markedly elevates the NER capabilities of BERT. Our approach presents a scalable methodology that reduces manual annotation costs and increases efficiency, making it especially pertinent in resource-limited and closed-network environments. The study concludes that while the 'Simple Mix' strategy yields the best results, understanding its underlying mechanisms requires further research. Future work will also focus on refining prompt designs and enhancing annotation selection processes, aiming to extend our methodology to diverse NLP tasks.
comment: 16 pages, 3 figures
♻ ☆ Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
In the rapidly advancing field of artificial intelligence, the concept of Red-Teaming or Jailbreaking large language models (LLMs) has emerged as a crucial area of study. This approach is especially significant in terms of assessing and enhancing the safety and robustness of these models. This paper investigates the intricate consequences of such modifications through model editing, uncovering a complex relationship between enhancing model accuracy and preserving its ethical integrity. Our in-depth analysis reveals a striking paradox: while injecting accurate information is crucial for model reliability, it can paradoxically destabilize the model's foundational framework, resulting in unpredictable and potentially unsafe behaviors. Additionally, we propose a benchmark dataset NicheHazardQA to investigate this unsafe behavior both within the same and cross topical domain. This aspect of our research sheds light on how the edits, impact the model's safety metrics and guardrails. Our findings show that model editing serves as a cost-effective tool for topical red-teaming by methodically applying targeted edits and evaluating the resultant model behavior.
comment: Under review. {https://huggingface.co/datasets/SoftMINER-Group/NicheHazardQA}
♻ ☆ Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph ICLR 2024
Although large language models (LLMs) have achieved significant success in various tasks, they often struggle with hallucination problems, especially in scenarios requiring deep and responsible reasoning. These issues could be partially addressed by introducing external knowledge graphs (KG) in LLM reasoning. In this paper, we propose a new LLM-KG integrating paradigm ``$\hbox{LLM}\otimes\hbox{KG}$'' which treats the LLM as an agent to interactively explore related entities and relations on KGs and perform reasoning based on the retrieved knowledge. We further implement this paradigm by introducing a new approach called Think-on-Graph (ToG), in which the LLM agent iteratively executes beam search on KG, discovers the most promising reasoning paths, and returns the most likely reasoning results. We use a number of well-designed experiments to examine and illustrate the following advantages of ToG: 1) compared with LLMs, ToG has better deep reasoning power; 2) ToG has the ability of knowledge traceability and knowledge correctability by leveraging LLMs reasoning and expert feedback; 3) ToG provides a flexible plug-and-play framework for different LLMs, KGs and prompting strategies without any additional training cost; 4) the performance of ToG with small LLM models could exceed large LLM such as GPT-4 in certain scenarios and this reduces the cost of LLM deployment and application. As a training-free method with lower computational cost and better generality, ToG achieves overall SOTA in 6 out of 9 datasets where most previous SOTAs rely on additional training.
comment: Accepted by ICLR 2024
♻ ☆ Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding LREC
Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.
comment: LREC-COLING 2024, Long Paper
♻ ☆ Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic COLING 2024
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic. The implementation code for LoT can be accessed at: \url{https://github.com/xf-zhao/LoT}.
comment: Accepted in COLING 2024. Code see https://github.com/xf-zhao/LoT
♻ ☆ Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers
Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
comment: code available at https://github.com/YerbaPage/DetectCodeGPT
♻ ☆ Examining the Limitations of Computational Rumor Detection Models Trained on Static Datasets LREC
A crucial aspect of a rumor detection model is its ability to generalize, particularly its ability to detect emerging, previously unknown rumors. Past research has indicated that content-based (i.e., using solely source posts as input) rumor detection models tend to perform less effectively on unseen rumors. At the same time, the potential of context-based models remains largely untapped. The main contribution of this paper is in the in-depth evaluation of the performance gap between content and context-based models specifically on detecting new, unseen rumors. Our empirical findings demonstrate that context-based models are still overly dependent on the information derived from the rumors' source post and tend to overlook the significant role that contextual information can play. We also study the effect of data split strategies on classifier performance. Based on our experimental results, the paper also offers practical suggestions on how to minimize the effects of temporal concept drift in static datasets during the training of rumor detection methods.
comment: Accepted at LREC-COLING 2024
Software Engineering
☆ Coupled Requirements-driven Testing of CPS: From Simulation To Reality
Failures in safety-critical Cyber-Physical Systems (CPS), both software and hardware-related, can lead to severe incidents impacting physical infrastructure or even harming humans. As a result, extensive simulations and field tests need to be conducted, as part of the verification and validation of system requirements, to ensure system safety. However, current simulation and field testing practices, particularly in the domain of small Unmanned Aerial Systems (sUAS), are ad-hoc and lack a thorough, structured testing process. Furthermore, there is a dearth of standard processes and methodologies to inform the design of comprehensive simulation and field tests. This gap in the testing process leads to the deployment of sUAS applications that are: (a) tested in simulation environments which do not adequately capture the real-world complexity, such as environmental factors, due to a lack of tool support; (b) not subjected to a comprehensive range of scenarios during simulation testing to validate the system requirements, due to the absence of a process defining the relationship between requirements and simulation tests; and (c) not analyzed through standard safety analysis processes, because of missing traceability between simulation testing artifacts and safety analysis artifacts. To address these issues, we have developed an initial framework for validating CPS, specifically focusing on sUAS and robotic applications. We demonstrate the suitability of our framework by applying it to an example from the sUAS domain. Our preliminary results confirm the applicability of our framework. We conclude with a research roadmap to outline our next research goals along with our current proposal.
☆ "How do people decide?": A Model for Software Library Selection
Modern-day software development is often facilitated by the reuse of third-party software libraries. Despite the significant effort to understand the factors contributing to library selection, it is relatively unknown how the libraries are selected and what tools are still needed to support the selection process. Using Straussian grounded theory, we conducted and analyzed the interviews of 24 professionals across the world and derived a model of library selection process which is governed by six selection patterns (i.e., rules). The model draws from marketing theory and lays the groundwork for the development of a library selection tool which captures the technical and non-technical aspects developers consider.
☆ CoverUp: Coverage-Guided LLM-Based Test Generation
This paper presents CoverUp, a novel system that drives the generation of high-coverage Python regression tests via a combination of coverage analysis and large-language models (LLMs). CoverUp iteratively improves coverage, interleaving coverage analysis with dialogs with the LLM to focus its attention on as yet uncovered lines and branches. The resulting test suites significantly improve coverage over the current state of the art: compared to CodaMosa, a hybrid LLM / search-based software testing system, CoverUp substantially improves coverage across the board. On a per-module basis, CoverUp achieves median line coverage of 81% (vs. 62%), branch coverage of 53% (vs. 35%) and line+branch coverage of 78% (vs. 55%). We show that CoverUp's iterative, coverage-guided approach is crucial to its effectiveness, contributing to nearly half of its successes.
comment: 11 pages
☆ Can Language Models Pretend Solvers? Logic Code Simulation with LLMs
Transformer-based large language models (LLMs) have demonstrated significant potential in addressing logic problems. capitalizing on the great capabilities of LLMs for code-related activities, several frameworks leveraging logical solvers for logic reasoning have been proposed recently. While existing research predominantly focuses on viewing LLMs as natural language logic solvers or translators, their roles as logic code interpreters and executors have received limited attention. This study delves into a novel aspect, namely logic code simulation, which forces LLMs to emulate logical solvers in predicting the results of logical programs. To further investigate this novel task, we formulate our three research questions: Can LLMs efficiently simulate the outputs of logic codes? What strength arises along with logic code simulation? And what pitfalls? To address these inquiries, we curate three novel datasets tailored for the logic code simulation task and undertake thorough experiments to establish the baseline performance of LLMs in code simulation. Subsequently, we introduce a pioneering LLM-based code simulation technique, Dual Chains of Logic (DCoL). This technique advocates a dual-path thinking approach for LLMs, which has demonstrated state-of-the-art performance compared to other LLM prompt strategies, achieving a notable improvement in accuracy by 7.06% with GPT-4-Turbo.
comment: 12 pages, 8 figures
☆ LLMs as Compiler for Arabic Programming Language
In this paper we introduce APL (Arabic Programming Language) that uses Large language models (LLM) as semi-compiler to covert Arabic text code to python code then run the code. Designing a full pipeline from the structure of the APL text then a prompt (using prompt engineering) then running the prodcued python code using PyRunner. This project has a three parts first python library, a playground with simple interface and this research paper.
☆ SoK: Comprehensive Analysis of Rug Pull Causes, Datasets, and Detection Tools in DeFi
Rug pulls pose a grave threat to the cryptocurrency ecosystem, leading to substantial financial loss and undermining trust in decentralized finance (DeFi) projects. With the emergence of new rug pull patterns, research on rug pull is out of state. To fill this gap, we first conducted an extensive analysis of the literature review, encompassing both scholarly and industry sources. By examining existing academic articles and industrial discussions on rug pull projects, we present a taxonomy inclusive of 34 root causes, introducing six new categories inspired by industry sources: burn, hidden owner, ownership transfer, unverified contract, external call, and fake LP lock. Based on the developed taxonomy, we evaluated current rug pull datasets and explored the effectiveness and limitations of existing detection mechanisms. Our evaluation indicates that the existing datasets, which document 2,448 instances, address only 7 of the 34 root causes, amounting to a mere 20% coverage. It indicates that existing open-source datasets need to be improved to study rug pulls. In response, we have constructed a more comprehensive dataset containing 2,360 instances, expanding the coverage to 54% with the best effort. In addition, the examination of 14 detection tools showed that they can identify 25 of the 34 root causes, achieving a coverage of 73.5%. There are nine root causes (Fake LP Lock, Hidden Fee, and Destroy Token, Fake Money Transfer, Ownership Transfer, Liquidity Pool Block, Freeze Account, Wash-Trading, Hedge) that the existing tools cannot cover. Our work indicates that there is a significant gap between current research and detection tools, and the actual situation of rug pulls.
☆ Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications
Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine-tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose TrustLLM, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, TrustLLM is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, TrustLLM employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate TrustLLM, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune TrustLLM. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, TrustLLM achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by TrustLLM achieved a consistency of about 38% compared to the ground truth causes.
☆ FineWAVE: Fine-Grained Warning Verification of Bugs for Automated Static Analysis Tools
The continual expansion of software size and complexity has led to an increased focus on reducing defects and bugs during development. Although Automated Static Analysis Tools (ASATs) offer help, in practice, the significant number of false positives can impede developers' productivity and confidence in the tools. Therefore, previous research efforts have explored learning-based methods to validate the reported warnings. Nevertheless, there are still some limitations. (1) The granularity of prior research is coarse, as it focuses on identifying either actionable warnings throughout extensive development histories or potential true warnings at the function level. These approaches lack specificity regarding individual bugs and warnings. (2) Machine learning-based approaches need much manual effort for feature engineering while existing deep learning-based approaches ignore key semantics between source code and warnings. (3) The small number of selected projects hinders the comprehensive evaluation of these approaches. In this paper, we proposed a fine-grained warning verification approach that is sensitive to bugs for improving the results of ASATs, namely \ourtool. Specifically, we design a novel LSTM-based model that captures both fine-grained semantics of source code and warnings from ASATs and highlights their correlations with cross-attention. To tackle the data scarcity of training and evaluation, we collected a large-scale dataset of 280,273 warnings, namely FineWA. It is ten times larger than the existing largest dataset. Then, we conducted extensive experiments on the dataset to evaluate FineWAVE. The experimental results demonstrate the effectiveness of our approach, with an F1-score of 97.79% for reducing false alarms and 67.06% for confirming actual warnings, which also significantly outperforms all baselines.
☆ Fine-Grained Assertion-Based Test Selection
For large software applications, running the whole test suite after each code change is time- and resource-intensive. Regression test selection techniques aim at reducing test execution time by selecting only the tests that are affected by code changes. However, existing techniques select test entities at coarse granularity levels such as test class, which causes imprecise test selection and executing unaffected tests. We propose a novel approach that increases the selection precision by analyzing test code at statement level and treating test assertions as the unit for selection. We implement our fine-grained test selection approach in a tool called SELERTION and evaluate it by comparing against two state-of-the-art test selection techniques using 11 open-source subjects. Our results show that SELERTION increases selection precision for all the subjects. Our test selection reduces, on average, 63% of the overall test time, making regression testing up to 23% faster than the other techniques. Our results also indicate that subjects with longer test execution time benefit more by our fine-grained selection technique.
♻ ☆ HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization LREC
Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at \url{https://github.com/FloatAI/humaneval-xl}.
comment: LREC-COLING 2024
♻ ☆ Scalable and Precise Application-Centered Call Graph Construction for Python
Call graph construction is the foundation of inter-procedural static analysis. PYCG is the state-of-the-art approach for constructing call graphs for Python programs. Unfortunately, PyCG does not scale to large programs when adapted to whole-program analysis where application and dependent libraries are both analyzed. Moreover, PyCG is flow-insensitive and does not fully support Python's features, hindering its accuracy. To overcome these drawbacks, we propose a scalable and precise approach for constructing application-centered call graphs for Python programs, and implement it as a prototype tool JARVIS. JARVIS maintains a type graph (i.e., type relations of program identifiers) for each function in a program to allow type inference. Taking one function as an input, JARVIS generates the call graph on-the-fly, where flow-sensitive intra-procedural analysis and inter-procedural analysis are conducted in turn and strong updates are conducted. Our evaluation on a micro-benchmark of 135 small Python programs and a macro-benchmark of 6 real-world Python applications has demonstrated that JARVIS can significantly improve PYCG by at least 67% faster in time, 84% higher in precision, and at least 20% higher in recall.
comment: 13 pages
♻ ☆ SpecGen: Automated Generation of Formal Program Specifications via Large Language Models
Formal program specifications play a crucial role in various stages of software development. However, manually crafting formal program specifications is rather difficult, making the job time-consuming and labor-intensive. It is even more challenging to write specifications that correctly and comprehensively describe the semantics of complex programs. To reduce the burden on software developers, automated specification generation methods have emerged. However, existing methods usually rely on predefined templates or grammar, making them struggle to accurately describe the behavior and functionality of complex real-world programs. To tackle this challenge, we introduce SpecGen, a novel technique for formal program specification generation based on Large Language Models. Our key insight is to overcome the limitations of existing methods by leveraging the code comprehension capability of LLMs. The process of SpecGen consists of two phases. The first phase employs a conversational approach that guides the LLM to generate appropriate specifications for a given program. The second phase, designed for where the LLM fails to generate correct specifications, applies four mutation operators to the model-generated specifications and selects verifiable specifications from the mutated ones through a novel heuristic selection strategy. We evaluate SpecGen on two datasets, including the SV-COMP Java category benchmark and a manually constructed dataset. Experimental results demonstrate that SpecGen succeeds in generating verifiable specifications for 279 out of 385 programs, outperforming the existing purely LLM-based approaches and conventional specification generation tools like Houdini and Daikon. Further investigations on the quality of generated specifications indicate that SpecGen can comprehensively articulate the behaviors of the input program.
♻ ☆ Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers
Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
comment: code available at https://github.com/YerbaPage/DetectCodeGPT
Programming Languages
☆ CoverUp: Coverage-Guided LLM-Based Test Generation
This paper presents CoverUp, a novel system that drives the generation of high-coverage Python regression tests via a combination of coverage analysis and large-language models (LLMs). CoverUp iteratively improves coverage, interleaving coverage analysis with dialogs with the LLM to focus its attention on as yet uncovered lines and branches. The resulting test suites significantly improve coverage over the current state of the art: compared to CodaMosa, a hybrid LLM / search-based software testing system, CoverUp substantially improves coverage across the board. On a per-module basis, CoverUp achieves median line coverage of 81% (vs. 62%), branch coverage of 53% (vs. 35%) and line+branch coverage of 78% (vs. 55%). We show that CoverUp's iterative, coverage-guided approach is crucial to its effectiveness, contributing to nearly half of its successes.
comment: 11 pages
♻ ☆ HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization LREC
Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at \url{https://github.com/FloatAI/humaneval-xl}.
comment: LREC-COLING 2024
Formal Languages and Automata Theory
♻ ☆ Lemur: Integrating Large Language Models in Automated Program Verification ICLR'24
The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks.
comment: Accepted at ICLR'24
♻ ☆ Simulations for Event-Clock Automata
Event-clock automata (ECA) are a well-known semantic subclass of timed automata (TA) which enjoy admirable theoretical properties, e.g., determinizability, and are practically useful to capture timed specifications. However, unlike for timed automata, there exist no implementations for checking non-emptiness of event-clock automata. As ECAs contain special prophecy clocks that guess and maintain the time to the next occurrence of specific events, they cannot be seen as a syntactic subclass of TA. Therefore, implementations for TA cannot be directly used for ECAs, and moreover the translation of an ECA to a semantically equivalent TA is expensive. Another reason for the lack of ECA implementations is the difficulty in adapting zone-based algorithms, critical in the timed automata setting, to the event-clock automata setting. This difficulty was studied by Geeraerts et al. in 2011, where the authors proposed a zone enumeration procedure that uses zone extrapolations for finiteness. In this article, we propose a different zone-based algorithm to solve the reachability problem for event-clock automata, using simulations for finiteness. A surprising consequence of our result is that for event-predicting automata, the subclass of event-clock automata that only use prophecy clocks, we obtain finiteness even without any simulations. For general event-clock automata, our new algorithm exploits the G-simulation framework, which is the coarsest known simulation relation in timed automata literature, and has been recently used for advances in other extensions of timed automata.
♻ ☆ Beatty Sequences for a Quadratic Irrational: Decidability and Applications
Let $\alpha$ and $\beta$ belong to the same quadratic field. We show that the inhomogeneous Beatty sequence $(\lfloor n \alpha + \beta \rfloor)_{n \geq 1}$ is synchronized, in the sense that there is a finite automaton that takes as input the Ostrowski representations of $n$ and $y$ in parallel, and accepts if and only if $y = \lfloor n \alpha + \beta \rfloor$. Since it is already known that the addition relation is computable for Ostrowski representations based on a quadratic number, a consequence is a new and rather simple proof that the first-order logical theory of these sequences with addition is decidable. The decision procedure is easily implemented in the free software Walnut. As an application, we show that for each $r \geq 1$ it is decidable whether the set $\{ \lfloor n \alpha + \beta \rfloor \, : \, n \geq 1 \}$ forms an additive basis (or asymptotic additive basis) of order $r$. Using our techniques, we also solve some open problems of Reble and Kimberling, and give an explicit characterization of a sequence of Hildebrand et al.
Symbolic Computation
☆ Applied Category Theory in the Wolfram Language using Categorica I: Diagrams, Functors and Fibrations
This article serves as a preliminary introduction to the design of a new, open-source applied and computational category theory framework, named Categorica, built on top of the Wolfram Language. Categorica allows one to configure and manipulate abstract quivers, categories, groupoids, diagrams, functors and natural transformations, and to perform a vast array of automated abstract algebraic computations using (arbitrary combinations of) the above structures; to manipulate and abstractly reason about arbitrary universal properties, including products, coproducts, pullbacks, pushouts, limits and colimits; and to manipulate, visualize and compute with strict (symmetric) monoidal categories, including full support for automated string diagram rewriting and diagrammatic theorem-proving. In so doing, Categorica combines the capabilities of an abstract computer algebra framework (thus allowing one to compute directly with epimorphisms, monomorphisms, retractions, sections, spans, cospans, fibrations, etc.) with those of a powerful automated theorem-proving system (thus allowing one to convert universal properties and other abstract constructions into (higher-order) equational logic statements that can be reasoned about and proved using standard automated theorem-proving methods, as well as to prove category-theoretic statements directly using purely diagrammatic methods). In this first of two articles introducing the design of the framework, we shall focus principally upon its handling of quivers, categories, diagrams, groupoids, functors and natural transformations, including demonstrations of both its algebraic manipulation and theorem-proving capabilities in each case.
comment: 71 pages, 29 figures
♻ ☆ Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic COLING 2024
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic. The implementation code for LoT can be accessed at: \url{https://github.com/xf-zhao/LoT}.
comment: Accepted in COLING 2024. Code see https://github.com/xf-zhao/LoT
Computer Vision and Pattern Recognition
☆ AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans
Recently, progress in acquisition equipment such as LiDAR sensors has enabled sensing increasingly spacious outdoor 3D environments. Making sense of such 3D acquisitions requires fine-grained scene understanding, such as constructing instance-based 3D scene segmentations. Commonly, a neural network is trained for this task; however, this requires access to a large, densely annotated dataset, which is widely known to be challenging to obtain. To address this issue, in this work we propose to predict instance segmentations for 3D scenes in an unsupervised way, without relying on ground-truth annotations. To this end, we construct a learning framework consisting of two components: (1) a pseudo-annotation scheme for generating initial unsupervised pseudo-labels; and (2) a self-training algorithm for instance segmentation to fit robust, accurate instances from initial noisy proposals. To enable generating 3D instance mask proposals, we construct a weighted proxy-graph by connecting 3D points with edges integrating multi-modal image- and point-based self-supervised features, and perform graph-cuts to isolate individual pseudo-instances. We then build on a state-of-the-art point-based architecture and train a 3D instance segmentation model, resulting in significant refinement of initial proposals. To scale to arbitrary complexity 3D scenes, we design our algorithm to operate on local 3D point chunks and construct a merging step to generate scene-level instance segmentations. Experiments on the challenging SemanticKITTI benchmark demonstrate the potential of our approach, where it attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline. The code will be made publicly available at https://github.com/artonson/autoinst.
comment: 9 pages, 7 figures
☆ latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
We present latentSplat, a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not enable fast inference of high resolution novel views due to slow volume rendering, or are limited to interpolation of close input views, even in simpler settings with a single central object, where 360-degree generalization is possible. In this work, we combine a regression-based approach with a generative model, moving towards both of these capabilities within the same method, trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient Gaussian splatting and a fast, generative decoder network. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
comment: Project website: https://geometric-rl.mpi-inf.mpg.de/latentsplat/
☆ HemoSet: The First Blood Segmentation Dataset for Automation of Hemostasis Management
Hemorrhaging occurs in surgeries of all types, forcing surgeons to quickly adapt to the visual interference that results from blood rapidly filling the surgical field. Introducing automation into the crucial surgical task of hemostasis management would offload mental and physical tasks from the surgeon and surgical assistants while simultaneously increasing the efficiency and safety of the operation. The first step in automation of hemostasis management is detection of blood in the surgical field. To propel the development of blood detection algorithms in surgeries, we present HemoSet, the first blood segmentation dataset based on bleeding during a live animal robotic surgery. Our dataset features vessel hemorrhage scenarios where turbulent flow leads to abnormal pooling geometries in surgical fields. These pools are formed in conditions endemic to surgical procedures -- uneven heterogeneous tissue, under glossy lighting conditions and rapid tool movement. We benchmark several state-of-the-art segmentation models and provide insight into the difficulties specific to blood detection. We intend for HemoSet to spur development of autonomous blood suction tools by providing a platform for training and refining blood segmentation models, addressing the precision needed for such robotics.
☆ AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue
In everyday communication, humans frequently use speech and gestures to refer to specific areas or objects, a process known as Referential Dialogue (RD). While prior studies have investigated RD through Large Language Models (LLMs) or Large Multimodal Models (LMMs) in static contexts, the exploration of Temporal Referential Dialogue (TRD) within audio-visual media remains limited. Two primary challenges hinder progress in this field: (1) the absence of comprehensive, untrimmed audio-visual video datasets with precise temporal annotations, and (2) the need for methods to integrate complex temporal auditory and visual cues effectively. To address these challenges, we introduce a novel framework to generate PU-VALOR, an extensive audio-visual dataset comprising over 114,000 untrimmed videos with accurate temporal demarcations. We also present AVicuna, featuring an Audio-Visual Tokens Interleaver (AVTI) that ensures the temporal alignment of audio-visual information. Additionally, we develop the A5-222K dataset, encompassing more than 200,000 audio-text pairings, to facilitate the audio and text alignments. Our experiments demonstrate that AVicuna can effectively handle TRD in audio-visual videos and achieve state-of-the-art performance on various audio-visual video understanding tasks, particularly in untrimmed videos. We further investigate the optimal audio-interleaving rate for interleaved audio-visual inputs, which maximizes performance on the Audio-Visual Event Dense Localization task.
☆ L-MAE: Longitudinal masked auto-encoder with time and severity-aware encoding for diabetic retinopathy progression prediction
Pre-training strategies based on self-supervised learning (SSL) have proven to be effective pretext tasks for many downstream tasks in computer vision. Due to the significant disparity between medical and natural images, the application of typical SSL is not straightforward in medical imaging. Additionally, those pretext tasks often lack context, which is critical for computer-aided clinical decision support. In this paper, we developed a longitudinal masked auto-encoder (MAE) based on the well-known Transformer-based MAE. In particular, we explored the importance of time-aware position embedding as well as disease progression-aware masking. Taking into account the time between examinations instead of just scheduling them offers the benefit of capturing temporal changes and trends. The masking strategy, for its part, evolves during follow-up to better capture pathological changes, ensuring a more accurate assessment of disease progression. Using OPHDIAT, a large follow-up screening dataset targeting diabetic retinopathy (DR), we evaluated the pre-trained weights on a longitudinal task, which is to predict the severity label of the next visit within 3 years based on the past time series examinations. Our results demonstrated the relevancy of both time-aware position embedding and masking strategies based on disease progression knowledge. Compared to popular baseline models and standard longitudinal Transformers, these simple yet effective extensions significantly enhance the predictive ability of deep classification models.
☆ Object Detectors in the Open Environment:Challenges, Solutions, and Outlook
With the emergence of foundation models, deep learning-based object detectors have shown practical usability in closed set scenarios. However, for real-world tasks, object detectors often operate in open environments, where crucial factors (\eg, data distribution, objective) that influence model learning are often changing. The dynamic and intricate nature of the open environment poses novel and formidable challenges to object detectors. Unfortunately, current research on object detectors in open environments lacks a comprehensive analysis of their distinctive characteristics, challenges, and corresponding solutions, which hinders their secure deployment in critical real-world scenarios. This paper aims to bridge this gap by conducting a comprehensive review and analysis of object detectors in open environments. We initially identified limitations of key structural components within the existing detection pipeline and propose the open environment object detector challenge framework that includes four quadrants (\ie, out-of-domain, out-of-category, robust learning, and incremental learning) based on the dimensions of the data / target changes. For each quadrant of challenges in the proposed framework, we present a detailed description and systematic analysis of the overarching goals and core difficulties, systematically review the corresponding solutions, and benchmark their performance over multiple widely adopted datasets. In addition, we engage in a discussion of open problems and potential avenues for future research. This paper aims to provide a fresh, comprehensive, and systematic understanding of the challenges and solutions associated with open-environment object detectors, thus catalyzing the development of more solid applications in real-world scenarios.
comment: 32 pages, 17 figures
☆ Constricting Normal Latent Space for Anomaly Detection with Normal-only Training Data ICLR
In order to devise an anomaly detection model using only normal training data, an autoencoder (AE) is typically trained to reconstruct the data. As a result, the AE can extract normal representations in its latent space. During test time, since AE is not trained using real anomalies, it is expected to poorly reconstruct the anomalous data. However, several researchers have observed that it is not the case. In this work, we propose to limit the reconstruction capability of AE by introducing a novel latent constriction loss, which is added to the existing reconstruction loss. By using our method, no extra computational cost is added to the AE during test time. Evaluations using three video anomaly detection benchmark datasets, i.e., Ped2, Avenue, and ShanghaiTech, demonstrate the effectiveness of our method in limiting the reconstruction capability of AE, which leads to a better anomaly detection model.
comment: ICLR Workshop 2024 (PML4LRS)
☆ Emotion Recognition from the perspective of Activity Recognition
Applications of an efficient emotion recognition system can be found in several domains such as medicine, driver fatigue surveillance, social robotics, and human-computer interaction. Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions. Continuous dimensional models of human affect, such as those based on valence and arousal are more accurate in describing a broad range of spontaneous everyday emotions than more traditional models of discrete stereotypical emotion categories (e.g. happiness, surprise). Most of the prior work on estimating valence and arousal considers laboratory settings and acted data. But, for emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world. Action recognition is a domain of Computer Vision that involves capturing complementary information on appearance from still frames and motion between frames. In this paper, we treat emotion recognition from the perspective of action recognition by exploring the application of deep learning architectures specifically designed for action recognition, for continuous affect recognition. We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism, which is an ensemble design based on sub-modules of multiple state-of-the-art action recognition systems. The pipeline constitutes a novel data pre-processing approach with a spatial self-attention mechanism to extract keyframes. The optical flow of high-attention regions of the face is extracted to capture temporal context. AFEW-VA in-the-wild dataset has been used to conduct comparative experiments. Quantitative analysis shows that the proposed model outperforms multiple standard baselines of both emotion recognition and action recognition models.
☆ Out-of-Distribution Detection via Deep Multi-Comprehension Ensemble
Recent research underscores the pivotal role of the Out-of-Distribution (OOD) feature representation field scale in determining the efficacy of models in OOD detection. Consequently, the adoption of model ensembles has emerged as a prominent strategy to augment this feature representation field, capitalizing on anticipated model diversity. However, our introduction of novel qualitative and quantitative model ensemble evaluation methods, specifically Loss Basin/Barrier Visualization and the Self-Coupling Index, reveals a critical drawback in existing ensemble methods. We find that these methods incorporate weights that are affine-transformable, exhibiting limited variability and thus failing to achieve the desired diversity in feature representation. To address this limitation, we elevate the dimensions of traditional model ensembles, incorporating various factors such as different weight initializations, data holdout, etc., into distinct supervision tasks. This innovative approach, termed Multi-Comprehension (MC) Ensemble, leverages diverse training tasks to generate distinct comprehensions of the data and labels, thereby extending the feature representation field. Our experimental results demonstrate the superior performance of the MC Ensemble strategy in OOD detection compared to both the naive Deep Ensemble method and a standalone model of comparable size. This underscores the effectiveness of our proposed approach in enhancing the model's capability to detect instances outside its training distribution.
☆ Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis CVPR2024
While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression, their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation, we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents, thereby facilitating the generation of high-quality images. Moreover, our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space, while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer, which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding, the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs, and the proposed entropy model contributes to notable bitrate savings.
comment: Accepted by CVPR2024
☆ Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features using the complementary strengths of various data modalities. However, the open nature of such systems inadvertently increases the possibility of backdoor attacks. These attacks subtly embed malicious behaviors within the model during training, which can be activated by specific triggers in the inference phase, posing significant security risks. Despite existing countermeasures through fine-tuning that reduce the adverse impacts of such attacks, these defenses often degrade the clean accuracy and necessitate the construction of extensive clean training pairs. In this paper, we explore the possibility of a less-cost defense from the perspective of model unlearning, that is, whether the model can be made to quickly \textbf{u}nlearn \textbf{b}ackdoor \textbf{t}hreats (UBT) by constructing a small set of poisoned samples. Specifically, we strengthen the backdoor shortcuts to discover suspicious samples through overfitting training prioritized by weak similarity samples. Building on the initial identification of suspicious samples, we introduce an innovative token-based localized forgetting training regime. This technique specifically targets the poisoned aspects of the model, applying a focused effort to unlearn the backdoor associations and trying not to damage the integrity of the overall model. Experimental results show that our method not only ensures a minimal success rate for attacks, but also preserves the model's high clean accuracy.
comment: 6 pages, 2 figures
☆ Partially Blinded Unlearning: Class Unlearning for Deep Networks a Bayesian Perspective
In order to adhere to regulatory standards governing individual data privacy and safety, machine learning models must systematically eliminate information derived from specific subsets of a user's training data that can no longer be utilized. The emerging discipline of Machine Unlearning has arisen as a pivotal area of research, facilitating the process of selectively discarding information designated to specific sets or classes of data from a pre-trained model, thereby eliminating the necessity for extensive retraining from scratch. The principal aim of this study is to formulate a methodology tailored for the purposeful elimination of information linked to a specific class of data from a pre-trained classification network. This intentional removal is crafted to degrade the model's performance specifically concerning the unlearned data class while concurrently minimizing any detrimental impacts on the model's performance in other classes. To achieve this goal, we frame the class unlearning problem from a Bayesian perspective, which yields a loss function that minimizes the log-likelihood associated with the unlearned data with a stability regularization in parameter space. This stability regularization incorporates Mohalanobis distance with respect to the Fisher Information matrix and $l_2$ distance from the pre-trained model parameters. Our novel approach, termed \textbf{Partially-Blinded Unlearning (PBU)}, surpasses existing state-of-the-art class unlearning methods, demonstrating superior effectiveness. Notably, PBU achieves this efficacy without requiring awareness of the entire training dataset but only to the unlearned data points, marking a distinctive feature of its performance.
☆ On the Equivalency, Substitutability, and Flexibility of Synthetic Data
We study, from an empirical standpoint, the efficacy of synthetic data in real-world scenarios. Leveraging synthetic data for training perception models has become a key strategy embraced by the community due to its efficiency, scalability, perfect annotations, and low costs. Despite proven advantages, few studies put their stress on how to efficiently generate synthetic datasets to solve real-world problems and to what extent synthetic data can reduce the effort for real-world data collection. To answer the questions, we systematically investigate several interesting properties of synthetic data -- the equivalency of synthetic data to real-world data, the substitutability of synthetic data for real data, and the flexibility of synthetic data generators to close up domain gaps. Leveraging the M3Act synthetic data generator, we conduct experiments on DanceTrack and MOT17. Our results suggest that synthetic data not only enhances model performance but also demonstrates substitutability for real data, with 60% to 80% replacement without performance loss. In addition, our study of the impact of synthetic data distributions on downstream performance reveals the importance of flexible data generators in narrowing domain gaps for improved model adaptability.
☆ Adversarially Masked Video Consistency for Unsupervised Domain Adaptation
We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark.
☆ Low Rank Groupwise Deformations for Motion Tracking in Cardiac Cine MRI
Diffeomorphic image registration is a commonly used method to deform one image to resemble another. While warping a single image to another is useful, it can be advantageous to warp multiple images simultaneously, such as in tracking the motion of the heart across a sequence of images. In this paper, our objective is to propose a novel method capable of registering a group or sequence of images to a target image, resulting in registered images that appear identical and therefore have a low rank. Moreover, we aim for these registered images to closely resemble the target image. Through experimental evidence, we will demonstrate our method's superior efficacy in producing low-rank groupwise deformations compared to other state-of-the-art approaches.
comment: A thesis submitted to the University of Birmingham for MSc Degree
☆ Dual-modal Prior Semantic Guided Infrared and Visible Image Fusion for Intelligent Transportation System
Infrared and visible image fusion (IVF) plays an important role in intelligent transportation system (ITS). The early works predominantly focus on boosting the visual appeal of the fused result, and only several recent approaches have tried to combine the high-level vision task with IVF. However, they prioritize the design of cascaded structure to seek unified suitable features and fit different tasks. Thus, they tend to typically bias toward to reconstructing raw pixels without considering the significance of semantic features. Therefore, we propose a novel prior semantic guided image fusion method based on the dual-modality strategy, improving the performance of IVF in ITS. Specifically, to explore the independent significant semantic of each modality, we first design two parallel semantic segmentation branches with a refined feature adaptive-modulation (RFaM) mechanism. RFaM can perceive the features that are semantically distinct enough in each semantic segmentation branch. Then, two pilot experiments based on the two branches are conducted to capture the significant prior semantic of two images, which then is applied to guide the fusion task in the integration of semantic segmentation branches and fusion branches. In addition, to aggregate both high-level semantics and impressive visual effects, we further investigate the frequency response of the prior semantics, and propose a multi-level representation-adaptive fusion (MRaF) module to explicitly integrate the low-frequent prior semantic with the high-frequent details. Extensive experiments on two public datasets demonstrate the superiority of our method over the state-of-the-art image fusion approaches, in terms of either the visual appeal or the high-level semantics.
☆ Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields CVPR 2024
Inverse rendering aims at recovering both geometry and materials of objects. It provides a more compatible reconstruction for conventional rendering engines, compared with the neural radiance fields (NeRFs). On the other hand, existing NeRF-based inverse rendering methods cannot handle glossy objects with local light interactions well, as they typically oversimplify the illumination as a 2D environmental map, which assumes infinite lights only. Observing the superiority of NeRFs in recovering radiance fields, we propose a novel 5D Neural Plenoptic Function (NeP) based on NeRFs and ray tracing, such that more accurate lighting-object interactions can be formulated via the rendering equation. We also design a material-aware cone sampling strategy to efficiently integrate lights inside the BRDF lobes with the help of pre-filtered radiance fields. Our method has two stages: the geometry of the target object and the pre-filtered environmental radiance fields are reconstructed in the first stage, and materials of the target object are estimated in the second stage with the proposed NeP and material-aware cone sampling strategy. Extensive experiments on the proposed real-world and synthetic datasets demonstrate that our method can reconstruct high-fidelity geometry/materials of challenging glossy objects with complex lighting interactions from nearby objects. Project webpage: https://whyy.site/paper/nep
comment: CVPR 2024 paper. Project webpage https://whyy.site/paper/nep
☆ Exemplar-Free Class Incremental Learning via Incremental Representation
Exemplar-Free Class Incremental Learning (efCIL) aims to continuously incorporate the knowledge from new classes while retaining previously learned information, without storing any old-class exemplars (i.e., samples). For this purpose, various efCIL methods have been proposed over the past few years, generally with elaborately constructed old pseudo-features, increasing the difficulty of model development and interpretation. In contrast, we propose a \textbf{simple Incremental Representation (IR) framework} for efCIL without constructing old pseudo-features. IR utilizes dataset augmentation to cover a suitable feature space and prevents the model from forgetting by using a single L2 space maintenance loss. We discard the transient classifier trained on each one of the sequence tasks and instead replace it with a 1-near-neighbor classifier for inference, ensuring the representation is incrementally updated during CIL. Extensive experiments demonstrate that our proposed IR achieves comparable performance while significantly preventing the model from forgetting on CIFAR100, TinyImageNet, and ImageNetSubset datasets.
☆ Leveraging Deep Learning and Xception Architecture for High-Accuracy MRI Classification in Alzheimer Diagnosis
Exploring the application of deep learning technologies in the field of medical diagnostics, Magnetic Resonance Imaging (MRI) provides a unique perspective for observing and diagnosing complex neurodegenerative diseases such as Alzheimer Disease (AD). With advancements in deep learning, particularly in Convolutional Neural Networks (CNNs) and the Xception network architecture, we are now able to analyze and classify vast amounts of MRI data with unprecedented accuracy. The progress of this technology not only enhances our understanding of brain structural changes but also opens up new avenues for monitoring disease progression through non-invasive means and potentially allows for precise diagnosis in the early stages of the disease. This study aims to classify MRI images using deep learning models to identify different stages of Alzheimer Disease through a series of innovative data processing and model construction steps. Our experimental results show that the deep learning framework based on the Xception model achieved a 99.6% accuracy rate in the multi-class MRI image classification task, demonstrating its potential application value in assistive diagnosis. Future research will focus on expanding the dataset, improving model interpretability, and clinical validation to further promote the application of deep learning technology in the medical field, with the hope of bringing earlier diagnosis and more personalized treatment plans to Alzheimer Disease patients.
☆ Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane
We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting.
comment: Video: https://youtu.be/lRn-HqyCrLI
☆ Image Captioning in news report scenario
Image captioning strives to generate pertinent captions for specified images, situating itself at the crossroads of Computer Vision (CV) and Natural Language Processing (NLP). This endeavor is of paramount importance with far-reaching applications in recommendation systems, news outlets, social media, and beyond. Particularly within the realm of news reporting, captions are expected to encompass detailed information, such as the identities of celebrities captured in the images. However, much of the existing body of work primarily centers around understanding scenes and actions. In this paper, we explore the realm of image captioning specifically tailored for celebrity photographs, illustrating its broad potential for enhancing news industry practices. This exploration aims to augment automated news content generation, thereby facilitating a more nuanced dissemination of information. Our endeavor shows a broader horizon, enriching the narrative in news reporting through a more intuitive image captioning framework.
comment: 10 pages, 4 figures
☆ Skull-to-Face: Anatomy-Guided 3D Facial Reconstruction and Editing
Deducing the 3D face from a skull is an essential but challenging task in forensic science and archaeology. Existing methods for automated facial reconstruction yield inaccurate results, suffering from the non-determinative nature of the problem that a skull with a sparse set of tissue depth cannot fully determine the skinned face. Additionally, their texture-less results require further post-processing stages to achieve a photo-realistic appearance. This paper proposes an end-to-end 3D face reconstruction and exploration tool, providing textured 3D faces for reference. With the help of state-of-the-art text-to-image diffusion models and image-based facial reconstruction techniques, we generate an initial reference 3D face, whose biological profile aligns with the given skull. We then adapt these initial faces to meet the statistical expectations of extruded anatomical landmarks on the skull through an optimization process. The joint statistical distribution of tissue depths is learned on a small set of anatomical landmarks on the skull. To support further adjustment, we propose an efficient face adaptation tool to assist users in tuning tissue depths, either globally or at local regions, while observing plausible visual feedback. Experiments conducted on a real skull-face dataset demonstrated the effectiveness of our proposed pipeline in terms of reconstruction accuracy, diversity, and stability.
☆ Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains CVPR 2024
This paper presents an innovative framework designed to train an image deblurring algorithm tailored to a specific camera device. This algorithm works by transforming a blurry input image, which is challenging to deblur, into another blurry image that is more amenable to deblurring. The transformation process, from one blurry state to another, leverages unpaired data consisting of sharp and blurry images captured by the target camera device. Learning this blur-to-blur transformation is inherently simpler than direct blur-to-sharp conversion, as it primarily involves modifying blur patterns rather than the intricate task of reconstructing fine image details. The efficacy of the proposed approach has been demonstrated through comprehensive experiments on various benchmarks, where it significantly outperforms state-of-the-art methods both quantitatively and qualitatively. Our code and data are available at https://zero1778.github.io/blur2blur/
comment: Accepted to CVPR 2024
☆ FH-SSTNet: Forehead Creases based User Verification using Spatio-Spatial Temporal Network
Biometric authentication, which utilizes contactless features, such as forehead patterns, has become increasingly important for identity verification and access management. The proposed method is based on learning a 3D spatio-spatial temporal convolution to create detailed pictures of forehead patterns. We introduce a new CNN model called the Forehead Spatio-Spatial Temporal Network (FH-SSTNet), which utilizes a 3D CNN architecture with triplet loss to capture distinguishing features. We enhance the model's discrimination capability using Arcloss in the network's head. Experimentation on the Forehead Creases version 1 (FH-V1) dataset, containing 247 unique subjects, demonstrates the superior performance of FH-SSTNet compared to existing methods and pre-trained CNNs like ResNet50, especially for forehead-based user verification. The results demonstrate the superior performance of FH-SSTNet for forehead-based user verification, confirming its effectiveness in identity authentication.
comment: 6 pages, 5 Figure, IWBF conference
☆ From Discrete to Continuous: Deep Fair Clustering With Transferable Representations
We consider the problem of deep fair clustering, which partitions data into clusters via the representations extracted by deep neural networks while hiding sensitive data attributes. To achieve fairness, existing methods present a variety of fairness-related objective functions based on the group fairness criterion. However, these works typically assume that the sensitive attributes are discrete and do not work for continuous sensitive variables, such as the proportion of the female population in an area. Besides, the potential of the representations learned from clustering tasks to improve performance on other tasks is ignored by existing works. In light of these limitations, we propose a flexible deep fair clustering method that can handle discrete and continuous sensitive attributes simultaneously. Specifically, we design an information bottleneck style objective function to learn fair and clustering-friendly representations. Furthermore, we explore for the first time the transferability of the extracted representations to other downstream tasks. Unlike existing works, we impose fairness at the representation level, which could guarantee fairness for the transferred task regardless of clustering results. To verify the effectiveness of the proposed method, we perform extensive experiments on datasets with discrete and continuous sensitive attributes, demonstrating the advantage of our method in comparison with state-of-the-art methods.
☆ Diffusion Model is a Good Pose Estimator from 3D RF-Vision
Human pose estimation (HPE) from Radio Frequency vision (RF-vision) performs human sensing using RF signals that penetrate obstacles without revealing privacy (e.g., facial information). Recently, mmWave radar has emerged as a promising RF-vision sensor, providing radar point clouds by processing RF signals. However, the mmWave radar has a limited resolution with severe noise, leading to inaccurate and inconsistent human pose estimation. This work proposes mmDiff, a novel diffusion-based pose estimator tailored for noisy radar data. Our approach aims to provide reliable guidance as conditions to diffusion models. Two key challenges are addressed by mmDiff: (1) miss-detection of parts of human bodies, which is addressed by a module that isolates feature extraction from different body parts, and (2) signal inconsistency due to environmental interference, which is tackled by incorporating prior knowledge of body structure and motion. Several modules are designed to achieve these goals, whose features work as the conditions for the subsequent diffusion model, eliminating the miss-detection and instability of HPE based on RF-vision. Extensive experiments demonstrate that mmDiff outperforms existing methods significantly, achieving state-of-the-art performances on public datasets.
♻ ☆ Ghost on the Shell: An Expressive Representation of General 3D Shapes ICLR 2024
The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parameterize open surfaces by defining a manifold signed distance field on watertight templates. With this parameterization, we further develop a grid-based and differentiable representation that parameterizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-Shell achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.
comment: ICLR 2024 Oral (v3: 30 pages, 19 figures, Project Page: https://gshell3d.github.io/)
♻ ☆ VQPy: An Object-Oriented Approach to Modern Video Analytics
Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
comment: MLSys'24
♻ ☆ Latent Dataset Distillation with Diffusion Models
The efficacy of machine learning has traditionally relied on the availability of increasingly larger datasets. However, large datasets pose storage challenges and contain non-influential samples, which could be ignored during training without impacting the final accuracy of the model. In response to these limitations, the concept of distilling the information on a dataset into a condensed set of (synthetic) samples, namely a distilled dataset, emerged. One crucial aspect is the selected architecture (usually ConvNet) for linking the original and synthetic datasets. However, the final accuracy is lower if the employed model architecture differs from the model used during distillation. Another challenge is the generation of high-resolution images, e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation with Diffusion Models (LD3M) that combine diffusion in latent space with dataset distillation to tackle both challenges. LD3M incorporates a novel diffusion process tailored for dataset distillation, which improves the gradient norms for learning synthetic images. By adjusting the number of diffusion steps, LD3M also offers a straightforward way of controlling the trade-off between speed and accuracy. We evaluate our approach in several ImageNet subsets and for high-resolution images (128x128 and 256x256). As a result, LD3M consistently outperforms state-of-the-art distillation techniques by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.
♻ ☆ BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling
Recent efforts in using 3D Gaussians for scene reconstruction and novel view synthesis can achieve impressive results on curated benchmarks; however, images captured in real life are often blurry. In this work, we analyze the robustness of Gaussian-Splatting-based methods against various image blur, such as motion blur, defocus blur, downscaling blur, \etc. Under these degradations, Gaussian-Splatting-based methods tend to overfit and produce worse results than Neural-Radiance-Field-based methods. To address this issue, we propose Blur Agnostic Gaussian Splatting (BAGS). BAGS introduces additional 2D modeling capacities such that a 3D-consistent and high quality scene can be reconstructed despite image-wise blur. Specifically, we model blur by estimating per-pixel convolution kernels from a Blur Proposal Network (BPN). BPN is designed to consider spatial, color, and depth variations of the scene to maximize modeling capacity. Additionally, BPN also proposes a quality-assessing mask, which indicates regions where blur occur. Finally, we introduce a coarse-to-fine kernel optimization scheme; this optimization scheme is fast and avoids sub-optimal solutions due to a sparse point cloud initialization, which often occurs when we apply Structure-from-Motion on blurry images. We demonstrate that BAGS achieves photorealistic renderings under various challenging blur conditions and imaging geometry, while significantly improving upon existing approaches.
♻ ☆ Detection of diabetic retinopathy using longitudinal self-supervised learning MICCAI
Longitudinal imaging is able to capture both static anatomical structures and dynamic changes in disease progression towards earlier and better patient-specific pathology management. However, conventional approaches for detecting diabetic retinopathy (DR) rarely take advantage of longitudinal information to improve DR analysis. In this work, we investigate the benefit of exploiting self-supervised learning with a longitudinal nature for DR diagnosis purposes. We compare different longitudinal self-supervised learning (LSSL) methods to model the disease progression from longitudinal retinal color fundus photographs (CFP) to detect early DR severity changes using a pair of consecutive exams. The experiments were conducted on a longitudinal DR screening dataset with or without those trained encoders (LSSL) acting as a longitudinal pretext task. Results achieve an AUC of 0.875 for the baseline (model trained from scratch) and an AUC of 0.96 (95% CI: 0.9593-0.9655 DeLong test) with a p-value < 2.2e-16 on early fusion using a simple ResNet alike architecture with frozen LSSL weights, suggesting that the LSSL latent space enables to encode the dynamic of DR progression.
comment: Accepted preprint for presentation at MICCAI-OMIA
♻ ☆ Influencer Backdoor Attack on Semantic Segmentation
When a small number of poisoned samples are injected into the training dataset of a deep neural network, the network can be induced to exhibit malicious behavior during inferences, which poses potential threats to real-world applications. While they have been intensively studied in classification, backdoor attacks on semantic segmentation have been largely overlooked. Unlike classification, semantic segmentation aims to classify every pixel within a given image. In this work, we explore backdoor attacks on segmentation models to misclassify all pixels of a victim class by injecting a specific trigger on non-victim pixels during inferences, which is dubbed Influencer Backdoor Attack (IBA). IBA is expected to maintain the classification accuracy of non-victim pixels and mislead classifications of all victim pixels in every single inference and could be easily applied to real-world scenes. Based on the context aggregation ability of segmentation models, we proposed a simple, yet effective, Nearest-Neighbor trigger injection strategy. We also introduce an innovative Pixel Random Labeling strategy which maintains optimal performance even when the trigger is placed far from the victim pixels. Our extensive experiments reveal that current segmentation models do suffer from backdoor attacks, demonstrate IBA real-world applicability, and show that our proposed techniques can further increase attack performance.
♻ ☆ DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization CVPR 2024
Radiance fields have demonstrated impressive performance in synthesizing novel views from sparse input views, yet prevailing methods suffer from high training costs and slow inference speed. This paper introduces DNGaussian, a depth-regularized framework based on 3D Gaussian radiance fields, offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from the highly efficient representation and surprising quality of the recent 3D Gaussian Splatting, despite it will encounter a geometry degradation when input views decrease. In the Gaussian radiance fields, we find this degradation in scene geometry primarily lined to the positioning of Gaussian primitives and can be mitigated by depth constraint. Consequently, we propose a Hard and Soft Depth Regularization to restore accurate scene geometry under coarse monocular depth supervision while maintaining a fine-grained color appearance. To further refine detailed geometry reshaping, we introduce Global-Local Depth Normalization, enhancing the focus on small local depth changes. Extensive experiments on LLFF, DTU, and Blender datasets demonstrate that DNGaussian outperforms state-of-the-art methods, achieving comparable or better results with significantly reduced memory cost, a $25 \times$ reduction in training time, and over $3000 \times$ faster rendering speed.
comment: Accepted at CVPR 2024. Project page: https://fictionarry.github.io/DNGaussian/
♻ ☆ DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching CVPR 2024
Matching 2D keypoints in an image to a sparse 3D point cloud of the scene without requiring visual descriptors has garnered increased interest due to its low memory requirements, inherent privacy preservation, and reduced need for expensive 3D model maintenance compared to visual descriptor-based methods. However, existing algorithms often compromise on performance, resulting in a significant deterioration compared to their descriptor-based counterparts. In this paper, we introduce DGC-GNN, a novel algorithm that employs a global-to-local Graph Neural Network (GNN) that progressively exploits geometric and color cues to represent keypoints, thereby improving matching accuracy. Our procedure encodes both Euclidean and angular relations at a coarse level, forming the geometric embedding to guide the point matching. We evaluate DGC-GNN on both indoor and outdoor datasets, demonstrating that it not only doubles the accuracy of the state-of-the-art visual descriptor-free algorithm but also substantially narrows the performance gap between descriptor-based and descriptor-free methods.
comment: CVPR 2024
♻ ☆ DemoCaricature: Democratising Caricature Generation with a Rough Sketch
In this paper, we democratise caricature generation, empowering individuals to effortlessly craft personalised caricatures with just a photo and a conceptual sketch. Our objective is to strike a delicate balance between abstraction and identity, while preserving the creativity and subjectivity inherent in a sketch. To achieve this, we present Explicit Rank-1 Model Editing alongside single-image personalisation, selectively applying nuanced edits to cross-attention layers for a seamless merge of identity and style. Additionally, we propose Random Mask Reconstruction to enhance robustness, directing the model to focus on distinctive identity and style features. Crucially, our aim is not to replace artists but to eliminate accessibility barriers, allowing enthusiasts to engage in the artistry.
♻ ☆ SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs ICRA 2024
Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
comment: ICRA 2024 accepted. Project website: https://sites.google.com/view/sg-bot
♻ ☆ C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion ICLR 2024
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.
comment: ICLR 2024
♻ ☆ UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN
Skin cancer is a significant public health problem, and computer-aided diagnosis can help to prevent and treat it. A crucial step for computer-aided diagnosis is accurately segmenting skin lesions in images, which allows for lesion detection, classification, and analysis. However, this task is challenging due to the diverse characteristics of lesions, such as appearance, shape, size, color, texture, and location, as well as image quality issues like noise, artifacts, and occlusions. Deep learning models have recently been applied to skin lesion segmentation, but they have high parameter counts and computational demands, making them unsuitable for mobile health applications. To address this challenge, we propose UCM-Net, a novel, efficient, and lightweight solution that integrates Multi-Layer Perceptions (MLP) and Convolutional Neural Networks (CNN). Unlike conventional UNet architectures, our UCMNet-Block reduces parameter overhead and enhances UCM-Net's learning capabilities, leading to robust segmentation performance. We validate UCM-Net's competitiveness through extensive experiments on PH2, isic2017 and isic2018 datasets. Remarkably, UCM-Net has less than 50KB parameters and less than 0.05 Giga-Operations Per Second (GLOPs), setting a new possible standard for efficiency in skin lesion segmentation. The source code will be publicly available.
comment: 17 pages, under review
♻ ☆ CEIMVEN: An Approach of Cutting Edge Implementation of Modified Versions of EfficientNet (V1-V2) Architecture for Breast Cancer Detection and Classification from Ultrasound Images
Undoubtedly breast cancer identifies itself as one of the most widespread and terrifying cancers across the globe. Millions of women are getting affected each year from it. Breast cancer remains the major one for being the reason of largest number of demise of women. In the recent time of research, Medical Image Computing and Processing has been playing a significant role for detecting and classifying breast cancers from ultrasound images and mammograms, along with the celestial touch of deep neural networks. In this research, we focused mostly on our rigorous implementations and iterative result analysis of different cutting-edge modified versions of EfficientNet architectures namely EfficientNet-V1 (b0-b7) and EfficientNet-V2 (b0-b3) with ultrasound image, named as CEIMVEN. We utilized transfer learning approach here for using the pre-trained models of EfficientNet versions. We activated the hyper-parameter tuning procedures, added fully connected layers, discarded the unprecedented outliers and recorded the accuracy results from our custom modified EfficientNet architectures. Our deep learning model training approach was related to both identifying the cancer affected areas with region of interest (ROI) techniques and multiple classifications (benign, malignant and normal). The approximate testing accuracies we got from the modified versions of EfficientNet-V1 (b0- 99.15%, b1- 98.58%, b2- 98.43%, b3- 98.01%, b4- 98.86%, b5- 97.72%, b6- 97.72%, b7- 98.72%) and EfficientNet-V2 (b0- 99.29%, b1- 99.01%, b2- 98.72%, b3- 99.43%) are showing very bright future and strong potentials of deep learning approach for the successful detection and classification of breast cancers from the ultrasound images at a very early stage. The code for this research is available here: https://github.com/ac005sheekar/CEIMVEN-Cutting-Edge-Implementation-of-Modified-EfficientNet-V1-V2-for-BreastCancer-Detection.
Information Retrieval
☆ Ultra Low-Cost Two-Stage Multimodal System for Non-Normative Behavior Detection
The online community has increasingly been inundated by a toxic wave of harmful comments. In response to this growing challenge, we introduce a two-stage ultra-low-cost multimodal harmful behavior detection method designed to identify harmful comments and images with high precision and recall rates. We first utilize the CLIP-ViT model to transform tweets and images into embeddings, effectively capturing the intricate interplay of semantic meaning and subtle contextual clues within texts and images. Then in the second stage, the system feeds these embeddings into a conventional machine learning classifier like SVM or logistic regression, enabling the system to be trained rapidly and to perform inference at an ultra-low cost. By converting tweets into rich multimodal embeddings through the CLIP-ViT model and utilizing them to train conventional machine learning classifiers, our system is not only capable of detecting harmful textual information with near-perfect performance, achieving precision and recall rates above 99\% but also demonstrates the ability to zero-shot harmful images without additional training, thanks to its multimodal embedding input. This capability empowers our system to identify unseen harmful images without requiring extensive and costly image datasets. Additionally, our system quickly adapts to new harmful content; if a new harmful content pattern is identified, we can fine-tune the classifier with the corresponding tweets' embeddings to promptly update the system. This makes it well suited to addressing the ever-evolving nature of online harmfulness, providing online communities with a robust, generalizable, and cost-effective tool to safeguard their communities.
comment: to be appear in International Workshop on Coordination, Organizations, Institutions, Norms and Ethics for Governance of Multi-Agent Systems
☆ Complementary Recommendation in E-commerce: Definition, Approaches, and Future Directions
In recent years, complementary recommendation has received extensive attention in the e-commerce domain. In this paper, we comprehensively summarize and compare 34 representative studies conducted between 2009 and 2024. Firstly, we compare the data and methods used for modeling complementary relationships between products, including simple complementarity and more complex scenarios such as asymmetric complementarity, the coexistence of substitution and complementarity relationships between products, and varying degrees of complementarity between different pairs of products. Next, we classify and compare the models based on the research problems of complementary recommendation, such as diversity, personalization, and cold-start. Furthermore, we provide a comparative analysis of experimental results from different studies conducted on the same dataset, which helps identify the strengths and weaknesses of the research. Compared to previous surveys, this paper provides a more updated and comprehensive summary of the research, discusses future research directions, and contributes to the advancement of this field.
comment: 20 pages,9 figures
☆ RankingSHAP -- Listwise Feature Attribution Explanations for Ranking Models
Feature attributions are a commonly used explanation type, when we want to posthoc explain the prediction of a trained model. Yet, they are not very well explored in IR. Importantly, feature attribution has rarely been rigorously defined, beyond attributing the most important feature the highest value. What it means for a feature to be more important than others is often left vague. Consequently, most approaches focus on just selecting the most important features and under utilize or even ignore the relative importance within features. In this work, we rigorously define the notion of feature attribution for ranking models, and list essential properties that a valid attribution should have. We then propose RankingSHAP as a concrete instantiation of a list-wise ranking attribution method. Contrary to current explanation evaluation schemes that focus on selections, we propose two novel evaluation paradigms for evaluating attributions over learning-to-rank models. We evaluate RankingSHAP for commonly used learning-to-rank datasets to showcase the more nuanced use of an attribution method while highlighting the limitations of selection-based explanations. In a simulated experiment we design an interpretable model to demonstrate how list-wise ranking attributes can be used to investigate model decisions and evaluate the explanations qualitatively. Because of the contrastive nature of the ranking task, our understanding of ranking model decisions can substantially benefit from feature attribution explanations like RankingSHAP.
☆ Knowledge-aware Dual-side Attribute-enhanced Recommendation
\textit{Knowledge-aware} recommendation methods (KGR) based on \textit{graph neural networks} (GNNs) and \textit{contrastive learning} (CL) have achieved promising performance. However, they fall short in modeling fine-grained user preferences and further fail to leverage the \textit{preference-attribute connection} to make predictions, leading to sub-optimal performance. To address the issue, we propose a method named \textit{\textbf{K}nowledge-aware \textbf{D}ual-side \textbf{A}ttribute-enhanced \textbf{R}ecommendation} (KDAR). Specifically, we build \textit{user preference representations} and \textit{attribute fusion representations} upon the attribute information in knowledge graphs, which are utilized to enhance \textit{collaborative filtering} (CF) based user and item representations, respectively. To discriminate the contribution of each attribute in these two types of attribute-based representations, a \textit{multi-level collaborative alignment contrasting} mechanism is proposed to align the importance of attributes with CF signals. Experimental results on four benchmark datasets demonstrate the superiority of KDAR over several state-of-the-art baselines. Further analyses verify the effectiveness of our method. The code of KDAR is released at: \href{https://github.com/TJTP/KDAR}{https://github.com/TJTP/KDAR}.
♻ ☆ Our Model Achieves Excellent Performance on MovieLens: What Does it Mean?
A typical benchmark dataset for recommender system (RecSys) evaluation consists of user-item interactions generated on a platform within a time period. The interaction generation mechanism partially explains why a user interacts with (e.g., like, purchase, rate) an item, and the context of when a particular interaction happened. In this study, we conduct a meticulous analysis of the MovieLens dataset and explain the potential impact of using the dataset for evaluating recommendation algorithms. We make a few main findings from our analysis. First, there are significant differences in user interactions at the different stages when a user interacts with the MovieLens platform. The early interactions largely define the user portrait which affects the subsequent interactions. Second, user interactions are highly affected by the candidate movies that are recommended by the platform's internal recommendation algorithm(s). Third, changing the order of user interactions makes it more difficult for sequential algorithms to capture the progressive interaction process. We further discuss the discrepancy between the interaction generation mechanism that is employed by the MovieLens system and that of typical real-world recommendation scenarios. In summary, the MovieLens platform demonstrates an efficient and effective way of collecting user preferences to address cold-starts. However, models that achieve excellent recommendation accuracy on the MovieLens dataset may not demonstrate superior performance in practice, for at least two kinds of differences: (i) the differences in the contexts of user-item interaction generation, and (ii) the differences in user knowledge about the item collections. While results on MovieLens can be useful as a reference, they should not be solely relied upon as the primary justification for the effectiveness of a recommendation system model.
♻ ☆ Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face
We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the seamless construction and deployment of interactive search engines. Spacerini makes state-of-the-art sparse and dense retrieval models more accessible to non-IR practitioners while minimizing deployment effort. This is useful for NLP researchers who want to better understand and validate their research by performing qualitative analyses of training corpora, for IR researchers who want to demonstrate new retrieval models integrated into the growing Pyserini ecosystem, and for third parties reproducing the work of other researchers. Spacerini is open source and includes utilities for loading, preprocessing, indexing, and deploying search engines locally and remotely. We demonstrate a portfolio of 13 search engines created with Spacerini for different use cases.
Data Structures and Algorithms
☆ Optimization on a Finer Scale: Bounded Local Subgradient Variation Perspective
We initiate the study of nonsmooth optimization problems under bounded local subgradient variation, which postulates bounded difference between (sub)gradients in small local regions around points, in either average or maximum sense. The resulting class of objective functions encapsulates the classes of objective functions traditionally studied in optimization, which are defined based on either Lipschitz continuity of the objective or H\"{o}lder/Lipschitz continuity of its gradient. Further, the defined class contains functions that are neither Lipschitz continuous nor have a H\"{o}lder continuous gradient. When restricted to the traditional classes of optimization problems, the parameters defining the studied classes lead to more fine-grained complexity bounds, recovering traditional oracle complexity bounds in the worst case but generally leading to lower oracle complexity for functions that are not ``worst case.'' Some highlights of our results are that: (i) it is possible to obtain complexity results for both convex and nonconvex problems with the (local or global) Lipschitz constant being replaced by a constant of local subgradient variation and (ii) mean width of the subdifferential set around the optima plays a role in the complexity of nonsmooth optimization, particularly in parallel settings. A consequence of (ii) is that for any error parameter $\epsilon > 0$, parallel oracle complexity of nonsmooth Lipschitz convex optimization is lower than its sequential oracle complexity by a factor $\tilde{\Omega}\big(\frac{1}{\epsilon}\big)$ whenever the objective function is piecewise linear with polynomially many pieces in the input size. This is particularly surprising as existing parallel complexity lower bounds are based on such classes of functions. The seeming contradiction is resolved by considering the region in which the algorithm is allowed to query the objective.
☆ A Novel exact algorithm for economic lot-sizing with piecewise linear production costs
In this paper, we study the single-item economic lot-sizing problem with production cost functions that are piecewise linear. The lot-sizing problem stands as a foundational cornerstone within the domain of lot-sizing problems. It is also applicable to a variety of important production planning problems which are special cases to it according to \cite{ou}. The problem becomes intractable when $m$, the number of different breakpoints of the production-cost function is variable as the problem was proven NP-hard by \cite{Florian1980}. For a fixed $m$ an $O(T^{2m+3})$ time algorithm was given by \cite{Koca2014} which was subsequently improved to $O(T^{m+2}\log(T))$ time by \cite{ou} where $T$ is the number of periods in the planning horizon.\newline We introduce a more efficient $O(T^{m+2})$ time algorithm for this problem which improves upon the previous state-of-the-art algorithm by Ou and which is derived using several novel algorithmic techniques that may be of independent interest.
comment: 10 pages
☆ Hashing geographical point data using the space-filling H-curve
We construct geohashing procedure based on using of space-filling H-curve. This curve provides a way to construct geohash with less computations than the construction based on usage of Hilbert curve. At the same time, H-curve has better clustering properties.
☆ Maximum Polygon Packing: The CG:SHOP Challenge 2024
We give an overview of the 2024 Computational Geometry Challenge targeting the problem \textsc{Maximum Polygon Packing}: Given a convex region $P$ in the plane, and a collection of simple polygons $Q_1, \ldots, Q_n$, each $Q_i$ with a respective value $c_i$, find a subset $S \subseteq \{1, \ldots,n\}$ and a feasible packing within $P$ of the polygons $Q_i$ (without rotation) for $i \in S$, maximizing $\sum_{i \in S} c_i$. Geometric packing problems, such as this, present significant computational challenges and are of substantial practical importance.
comment: 16 pages, 10 figures
☆ Convolution and Knapsack in Higher Dimensions
In the Knapsack problem, one is given the task of packing a knapsack of a given size with items in order to gain a packing with a high profit value. In recent years, a connection to the $(\max,+)$-convolution problem has been established, where knapsack solutions can be combined by building the convolution of two sequences. This observation has been used to give conditional lower bounds but also parameterized algorithms. In this paper we want to carry these results into higher dimension. We consider Knapsack where items are characterized by multiple properties - given through a vector - and a knapsack that has a capacity vector. The packing must now not exceed any of the given capacity constraints. In order to show a similar sub-quadratic lower bound we introduce a multi-dimensional version of convolution as well. Instead of combining sequences, we will generalize this problem and combine higher dimensional matrices. We will establish a few variants of these problems and prove that they are all equivalent in terms of algorithms that allow for a running time sub-quadratic in the number of entries of the matrix. We further develop a parameterized algorithm to solve higher dimensional Knapsack. The techniques we apply are inspired by an algorithm introduced by Axiotis and Tzamos. In general, we manage not only to extend their result to higher dimension. We will show that even for higher dimensional Knapsack, we can reduce the problem to convolution on one-dimensional sequences, leading to an $\mathcal{O}(d(n + D \cdot \max\{\Pi_{i=1}^d{t_i}, t_{\max}\log t_{\max}\} ))$ algorithm, where $D$ is the number of different weight vectors, $t$ the capacity vector and $d$ is the dimension of the problem. Finally we also modify this algorithm to handle items with negative weights to cross the bridge from solving not only Knapsack but also Integer Linear Programs (ILPs) in general.
♻ ☆ Enhancing Grover's Search Algorithm: A Modified Approach to Increase the Probability of Good States
This article introduces an enhancement to the Grover search algorithm to speed up computing the probability of finding good states. It suggests incorporating a rotation phase angle determined mathematically from the derivative of the model during the initial iteration. At each iteration, a new phase angle is computed and used in a rotation gate around y+z axis in the diffusion operator. The computed phase angles are optimized through an adaptive adjustment based on the estimated increasing ratio of the consecutive amplitudes. The findings indicate an average decrease of 28% in the required number of iterations resulting in a faster overall process and fewer number of quantum gates. For large search space, this improvement rises to 29.58%. Given the computational capabilities of the computer utilized for the simulation, the approach is applied to instances with up to 12 qubits or 4096 possible combination of search entries.
♻ ☆ Gradient-less Federated Gradient Boosting Trees with Learnable Learning Rates
The privacy-sensitive nature of decentralized datasets and the robustness of eXtreme Gradient Boosting (XGBoost) on tabular data raise the needs to train XGBoost in the context of federated learning (FL). Existing works on federated XGBoost in the horizontal setting rely on the sharing of gradients, which induce per-node level communication frequency and serious privacy concerns. To alleviate these problems, we develop an innovative framework for horizontal federated XGBoost which does not depend on the sharing of gradients and simultaneously boosts privacy and communication efficiency by making the learning rates of the aggregated tree ensembles learnable. We conduct extensive evaluations on various classification and regression datasets, showing our approach achieves performance comparable to the state-of-the-art method and effectively improves communication efficiency by lowering both communication rounds and communication overhead by factors ranging from 25x to 700x. Project Page: https://flower.ai/blog/2023-04-19-xgboost-with-flower/
comment: Accepted at the 3rd ACM Workshop on Machine Learning and Systems (EuroMLSys), May 8th 2023, Rome, Italy
♻ ☆ Best of Both Worlds Guarantees for Smoothed Online Quadratic Optimization
We study the smoothed online quadratic optimization (SOQO) problem where, at each round $t$, a player plays an action $x_t$ in response to a quadratic hitting cost and an additional squared $\ell_2$-norm cost for switching actions. This problem class has strong connections to a wide range of application domains including smart grid management, adaptive control, and data center management, where switching-efficient algorithms are highly sought after. We study the SOQO problem in both adversarial and stochastic settings, and in this process, perform the first stochastic analysis of this class of problems. We provide the online optimal algorithm when the minimizers of the hitting cost function evolve as a general stochastic process, which, for the case of martingale process, takes the form of a distribution-agnostic dynamic interpolation algorithm (LAI). Next, we present the stochastic-adversarial trade-off by proving an $\Omega(T)$ expected regret for the adversarial optimal algorithm in the literature (ROBD) with respect to LAI and, a sub-optimal competitive ratio for LAI in the adversarial setting. Finally, we present a best-of-both-worlds algorithm that obtains a robust adversarial performance while simultaneously achieving a near-optimal stochastic performance.
comment: 48 pages, 9 figures
Signal Processing
☆ Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models
We describe a novel approach for developing realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for many applications, the design process is challenging because the compressors operate nonlinearly over long time scales. Our approach is based on the structured state space sequence model (S4), as implementing the state-space model (SSM) has proven to be efficient at learning long-range dependencies and is promising for modeling dynamic range compressors. We present in this paper a deep learning model with S4 layers to model the Teletronix LA-2A analog dynamic range compressor. The model is causal, executes efficiently in real time, and achieves roughly the same quality as previous deep-learning models but with fewer parameters.
☆ Site-Specific Beam Alignment in 6G via Deep Learning
Beam alignment (BA) in modern millimeter wave standards such as 5G NR and WiGig (802.11ay) is based on exhaustive and/or hierarchical beam searches over pre-defined codebooks of wide and narrow beams. This approach is slow and bandwidth/power-intensive, and is a considerable hindrance to the wide deployment of millimeter wave bands. A new approach is needed as we move towards 6G. BA is a promising use case for deep learning (DL) in the 6G air interface, offering the possibility of automated custom tuning of the BA procedure for each cell based on its unique propagation environment and user equipment (UE) location patterns. We overview and advocate for such an approach in this paper, which we term site-specific beam alignment (SSBA). SSBA largely eliminates wasteful searches and allows UEs to be found much more quickly and reliably, without many of the drawbacks of other machine learning-aided approaches. We first overview and demonstrate new results on SSBA, then identify the key open challenges facing SSBA.
comment: Accepted for publication in the IEEE Communications Magazine
☆ Fusion of Active and Passive Measurements for Robust and Scalable Positioning
This paper addresses the challenge of achieving reliable and robust positioning of a mobile agent, such as a radio device carried by a person, in scenarios where direct line-of-sight (LOS) links are obstructed or unavailable. The human body is considered as an extended object that scatters, attenuates and blocks the radio signals. We propose a novel particle-based sum-product algorithm (SPA) that fuses active measurements between the agent and anchors with passive measurements from pairs of anchors reflected off the body. We first formulate radio signal models for both active and passive measurements. Then, a joint tracking algorithm that utilizes both active and passive measurements is developed for the extended object. The algorithm exploits the probabilistic data association (PDA) for multiple object-related measurements. The results demonstrate superior accuracy during and after the obstructed line-of-sight (OLOS) situation, outperforming conventional methods that solely rely on active measurements. The proposed joint estimation approach significantly enhances the localization robustness via radio sensing.
☆ On the Secrecy Enhancement of an Integrated Ground-Aerial Network with a Hybrid FSO/THz Feeder Link
High altitude platforms (HAPs)-aided terrestrial-aerial communication technology based on free-space optical (FSO) and Terahertz (THz) feeder links has been attracting notable interest recently due to its great potential in reaching a higher data rate and connectivity. Nonetheless, the presence of harsh vertical propagation environments and potential aerial eavesdroppers are two of the main challenges limiting the reliability and security of such a technology. In this work, a secrecy-enhancing scheme for HAP-aided ground-aerial communication is proposed. The considered network consists of HAP-assisted communication between a ground station and a legitimate user under the threat of an aerial and ground eavesdropper. Thus, the proposed scheme leverages (i) HAP diversity by exploiting the presence of multiple flying HAPs and (ii) the use of a hybrid FSO/THz transmission scheme to offer better resilience against eavesdropping attacks. An analytical secrecy outage probability (SOP) expression is derived for the scheme in consideration. Results manifest the notable gain in security of the proposed scheme with respect to both (i) the single-HAP and (ii) THz feeder-based benchmark ones, where the proposed scheme's SOP is decreased by four orders of magnitude using $4$ HAPs with respect to the first benchmark scheme, while a $5$-dB secrecy gain is manifested with respect to the second benchmark one.
comment: 14 pages, 9 figures
☆ Holography inspired self-controlled reconfigurable intelligent surface
Among various promising candidate technologies for the sixth-generation (6G) wireless communications, recent advances in microwave metasurfaces have sparked a new research area of reconfigurable intelligent surfaces (RISs). By controllably reprogramming the wireless propagation channel, RISs are envisioned to achieve low-cost wireless capacity boosting, coverage extension, and enhanced energy efficiency. To reprogram the channel, each meta-atom on RIS needs an external control signal, which is usually generated by base station (BS). However, BS-controlled RISs require complicated control cables, which hamper their massive deployments. Here, we eliminate the need for BS control by proposing a self-controlled RIS (SC-RIS), which is inspired by the optical holography principle. Different from the existing BS-controlled RISs, each meta-atom of SC-RIS is integrated with an additional power detector for holographic recording. By applying the classical Fourier-transform processing to the measured hologram, SC-RIS is capable of retrieving the user's channel state information required for beamforming, thus enabling autonomous RIS beamforming without control cables. Owing to this WiFi-like plug-and-play capability without the BS control, SC-RISs are expected to enable easy and massive deployments in the future 6G systems.
comment: Traditional BS-controlled RISs suffer from complicated control cables. To "cut" the control cables, we propose a self-controlled RIS by leveraging the holographic interference principle, thus realizing autonomous RIS beamforming
♻ ☆ Assessing cognitive function among older adults using machine learning and wearable device data: a feasibility study
Timely implementation of interventions to slow cognitive decline among older adults requires accurate monitoring to detect changes in cognitive function. Data gathered using wearable devices that can continuously monitor factors known to be associated with cognition could be used to train machine learning models and develop wearable-based cognitive monitoring systems. Using data from over 2,400 older adults in the National Health and Nutrition Examination Survey (NHANES) we developed prediction models to differentiate older adults with normal cognition from those with poor cognition based on outcomes from three cognitive tests measuring different domains of cognitive function. During repeated cross-validation, CatBoost, XGBoost, and Random Forest models performed best when predicting cognition based on processing speed, working memory, and attention (median AUCs >0.82) compared to immediate and delayed recall (median AUCs >0.72) and categorical verbal fluency (median AUC >0.68). Activity and sleep parameters were also more strongly associated with processing speed, working memory, and attention compared to other cognitive subdomains. Our work provides proof of concept that wearable-based cognitive monitoring systems may be a viable alternative to traditional methods for monitoring processing speeds, working memory, and attention. We further identified novel metrics that could be targets in future causal studies seeking to better understand how sleep and activity parameters influence cognitive function among older adults.
♻ ☆ Iterative execution of discrete and inverse discrete Fourier transforms with applications for signal denoising via sparsification
We describe a family of iterative algorithms that involve the repeated execution of discrete and inverse discrete Fourier transforms. One interesting member of this family is motivated by the discrete Fourier transform uncertainty principle and involves the application of a sparsification operation to both the real domain and frequency domain data with convergence obtained when real domain sparsity hits a stable pattern. This sparsification variant has practical utility for signal denoising, in particular the recovery of a periodic spike signal in the presence of Gaussian noise. General convergence properties and denoising performance relative to existing methods are demonstrated using simulation studies. An R package implementing this technique and related resources can be found at https://hrfrost.host.dartmouth.edu/IterativeFT.
♻ ☆ Modulation and Estimation with a Helper
The problem of transmitting a parameter value over an additive white Gaussian noise (AWGN) channel is considered, where, in addition to the transmitter and the receiver, there is a helper that observes the noise non-causally and provides a description of limited rate $R_h$ to the transmitter and/or the receiver. We derive upper and lower bounds on the optimal achievable $\alpha$-th moment of the estimation error and show that they coincide for small values of $\alpha$ and for high values of $R_h$. The upper bound relies on a recently proposed channel-coding scheme that effectively conveys $R_h$ bits essentially error-free and the rest of the rate - over the same AWGN channel without help, with the error-free bits being allocated to the most significant bits of the quantized parameter. We then concentrate on the setting with a total transmit energy constraint, for which we derive achievability results for both channel coding and parameter modulation for several scenarios: when the helper assists only the transmitter or only the receiver and knows the noise, and when the helper assists the transmitter and/or the receiver and knows both the noise and the message. In particular, for the message-informed helper that assists both the receiver and the transmitter, it is shown that the error probability in the channel-coding task decays doubly exponentially. Finally, we translate these results to those for continuous-time power-limited AWGN channels with unconstrained bandwidth. As a byproduct, we show that the capacity with a message-informed helper that is available only at the transmitter can exceed the sum of the capacity without help and the help rate $R_h$, when the helper knows only the noise but not the message.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Cooperative Sense and Avoid for UAVs using Secondary Radar
A cooperative Sense and Avoid (SAA) algorithm for safe navigation of small-sized UAVs within an airspace is proposed in this paper. The proposed method relies upon cooperation between the UAV and the surrounding transponder-equipped aviation obstacles. To do so, the aviation obstacles share their altitude and identification code with the UAV by using a Mode S operation of the Secondary Surveillance Radar (SSR) after interrogation. The proposed SAA algorithm benefits from the estimate of the aviation obstacle's elevation angle for ranging. This results in more accurate ranging compared to the round-trip time-based ranging, which is currently used in existing SAA systems. We also propose a low-complexity and accurate radial velocity estimator for the Mode S operation of the SSR which is employed in the proposed SAA system. Furthermore, by considering the Pulse-Position Modulation (PPM) of the transponder reply as a waveform of pulse radar with random pulse repetition intervals, the maximum unambiguous radial velocity is obtained. The proposed SAA is equipped with an intruder identification method that determines the risk level ofthe surrounding transponder-equipped aviation obstacles. Given the estimated parameters, the intruder identification method classifies the aviation obstacles into high-, medium-, and low-risk intruders. The output of the classifier enables the UAV to plan its path or maneuver for safe navigation accordingly. The root mean square error (RMSE) of the proposed estimators are analytically derived, and the effectiveness of our SAA solution is confirmed through simulation experiments.
♻ ☆ Joint Ultra-wideband Characterization of Azimuth, Elevation and Time of Arrival with Toric Arrays
In this paper, we present an analytical framework for the joint characterization of the 3D direction of arrival (DoA), i.e., azimuth and elevation components, and time of arrival (ToA) in multipath environments. The analytical framework is based on the use of nearly frequency-invariant beamformers (FIB) formed by toric arrays. The frequency response of the toric array is expanded as a series of phase modes, which leads to azimuth-time and elevation-time diagrams from which the 3D DoA and the ToA of the incoming waves can be extracted over a wide bandwidth. Firstly, we discuss some practical considerations, advantages and limitations of using the analytical method. Subsequently, we perform a parametric study to analyze the influence of the method parameters on the quality of the estimation. The method is tested in single-path and multipath mm-wave environments over a large bandwidth. The results show that the proposed method improves the quality of the estimation, i.e., decreases the level of the artifacts, compared to other state-of-art FIB approaches based on the use of single/concentric circular and elliptical arrays.
comment: Published in IEEE Transactions on Wireless Communications
♻ ☆ Intelligent Surfaces Empowered Wireless Network: Recent Advances and The Road to 6G
Intelligent surfaces (ISs) have emerged as a key technology to empower a wide range of appealing applications for wireless networks, due to their low cost, high energy efficiency, flexibility of deployment and capability of constructing favorable wireless channels/radio environments. Moreover, the recent advent of several new IS architectures further expanded their electromagnetic functionalities from passive reflection to active amplification, simultaneous reflection and refraction, as well as holographic beamforming. However, the research on ISs is still in rapid progress and there have been recent technological advances in ISs and their emerging applications that are worthy of a timely review. Thus, we provide in this paper a comprehensive survey on the recent development and advances of ISs aided wireless networks. Specifically, we start with an overview on the anticipated use cases of ISs in future wireless networks such as 6G, followed by a summary of the recent standardization activities related to ISs. Then, the main design issues of the commonly adopted reflection-based IS and their state-of-the-art solutions are presented in detail, including reflection optimization, deployment, signal modulation, wireless sensing, and integrated sensing and communications. Finally, recent progress and new challenges in advanced IS architectures are discussed to inspire futrue research.
♻ ☆ Movable-Antenna Array Enhanced Beamforming: Achieving Full Array Gain with Null Steering
Conventional beamforming with fixed-position antenna (FPA) arrays has a fundamental trade-off between maximizing the signal power (array gain) over a desired direction and simultaneously minimizing the interference power over undesired directions. To overcome this limitation, this letter investigates the movable antenna (MA) array enhanced beamforming by exploiting the new degree of freedom (DoF) via antenna position optimization, in addition to the design of antenna weights. We show that by jointly optimizing the antenna positions vector (APV) and antenna weights vector (AWV) of a linear MA array, the full array gain can be achieved over the desired direction while null steering can be realized over all undesired directions, under certain numbers of MAs and null-steering directions. The optimal solutions for AWV and APV are derived in closed form, which reveal that the optimal AWV for MA arrays requires only the signal phase adjustment with a fixed amplitude. Numerical results validate our analytical solutions for MA array beamforming and show their superior performance to the conventional beamforming techniques with FPA arrays.
♻ ☆ Movable-Antenna Enhanced Multiuser Communication via Antenna Position Optimization
Movable antenna (MA) is a promising technology to improve wireless communication performance by varying the antenna position in a given finite area at the transceivers to create more favorable channel conditions. In this paper, we investigate the MA-enhanced multiple-access channel (MAC) for the uplink transmission from multiple users each equipped with a single MA to a base station (BS) with a fixed-position antenna (FPA) array. A field-response based channel model is used to characterize the multi-path channel between the antenna array of the BS and each user's MA with a flexible position. To evaluate the MAC performance gain provided by MAs, we formulate an optimization problem for minimizing the total transmit power of users, subject to a minimum-achievable-rate requirement for each user, where the positions of MAs and the transmit powers of users, as well as the receive combining matrix of the BS are jointly optimized. To solve this non-convex optimization problem involving intricately coupled variables, we develop two algorithms based on zero-forcing (ZF) and minimum mean square error (MMSE) combining methods, respectively. Specifically, for each algorithm, the combining matrix of the BS and the total transmit power of users are expressed as a function of the MAs' position vectors, which are then optimized by using the proposed multi-directional descent (MDD) framework. It is shown that the proposed ZF-based and MMSE-based MDD algorithms can converge to high-quality suboptimal solutions with low computational complexities. Simulation results demonstrate that the proposed solutions for MA-enhanced multiple access systems can significantly decrease the total transmit power of users as compared to conventional FPA systems employing antenna selection under both perfect and imperfect field-response information.
♻ ☆ Modeling and Performance Analysis for Movable Antenna Enabled Wireless Communications
In this paper, we propose a novel antenna architecture called movable antenna (MA) to improve the performance of wireless communication systems. Different from conventional fixed-position antennas (FPAs) that undergo random wireless channel variation, the MAs with the capability of flexible movement can be deployed at positions with more favorable channel conditions to achieve higher spatial diversity gains. To characterize the general multi-path channel in a given region or field where the MAs are deployed, a field-response model is developed by leveraging the amplitude, phase, and angle of arrival/angle of departure (AoA/AoD) information on each of the multiple channel paths under the far-field condition. Based on this model, we then analyze the maximum channel gain achieved by a single receive MA as compared to its FPA counterpart in both deterministic and stochastic channels. First, in the deterministic channel case, we show the periodic behavior of the multi-path channel gain in a given spatial field, which can be exploited for analyzing the maximum channel gain of the MA. Next, in the case of stochastic channels, the expected value of an upper bound on the maximum channel gain of the MA in an infinitely large receive region is derived for different numbers of channel paths. The approximate cumulative distribution function (CDF) for the maximum channel gain is also obtained in closed form, which is useful to evaluate the outage probability of the MA system. Moreover, our results reveal that higher performance gains by the MA over the FPA can be acquired when the number of channel paths increases due to more pronounced small-scale fading effects in the spatial domain. Numerical examples are presented which validate our analytical results and demonstrate that the MA system can reap considerable performance gains over the conventional FPA systems with/without antenna selection (AS).
♻ ☆ Movable Antennas for Wireless Communication: Opportunities and Challenges
Movable antenna (MA) technology is a recent development that fully exploits the wireless channel spatial variation in a confined region by enabling local movement of the antenna. Specifically, the positions of antennas at the transmitter and/or receiver can be dynamically changed to obtain better channel conditions for improving the communication performance. In this article, we first provide an overview of the promising applications for MA-aided wireless communication. Then, we present the hardware architecture and channel characterization for MA systems, based on which the variation of the channel gain with respect to the MA's position is illustrated. Furthermore, we analyze the performance advantages of MAs over conventional fixed-position antennas, in terms of signal power improvement, interference mitigation, flexible beamforming, and spatial multiplexing. Finally, we discuss the main design challenges and their potential solutions for MA-aided communication systems.
Sound
☆ Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models
We describe a novel approach for developing realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for many applications, the design process is challenging because the compressors operate nonlinearly over long time scales. Our approach is based on the structured state space sequence model (S4), as implementing the state-space model (SSM) has proven to be efficient at learning long-range dependencies and is promising for modeling dynamic range compressors. We present in this paper a deep learning model with S4 layers to model the Teletronix LA-2A analog dynamic range compressor. The model is causal, executes efficiently in real time, and achieves roughly the same quality as previous deep-learning models but with fewer parameters.
☆ Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy IJCNN 2024
Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.
comment: Accepted by IJCNN 2024
☆ Theoretical Analysis of Quality of Conventional Beamforming for Phased Microphone Arrays
A theoretical study is performed to analyze the directional response of different types of microphone array designs. 1-D (linear) and 2-D (planar) microphone array types are considered, and the delay and sum beamforming and conventional beamforming techniques are employed to localize the sound source. A non-dimensional parameter, G, is characterized to simplify and standardize the rejection performance of both 1-D and 2-D microphone arrays as a function of array geometry and sound source parameters. This parameter G is then used to determine an improved design of a 2-D microphone array for far-field sound localization. One such design, termed the Equi-area array is introduced and analyzed in detail. The design is shown to have an advantageous rejection performance compared to other conventionally used 2-D planar microphone arrays.
Robotics
☆ Single-Motor Robotic Gripper with Multi-Surface Fingers for Variable Grasping Configurations
This study proposes a novel robotic gripper with variable grasping configurations for grasping various objects. The fingers of the developed gripper incorporate multiple different surfaces. The gripper possesses the function of altering the finger surfaces facing a target object by rotating the fingers in its longitudinal direction. In the proposed design equipped with two fingers, the two fingers incorporate three and four surfaces, respectively, resulting in the nine available grasping configurations by the combination of these finger surfaces. The developed gripper is equipped with the functions of opening/closing its fingers for grasping and rotating its fingers to alter the grasping configuration -all achieved with a single motor. To enable the two motions using a single motor, this study introduces a self-motion switching mechanism utilizing magnets. This mechanism automatically transitions between gripper motions based on the direction of the motor rotation when the gripper is fully opened. In this state, rotating the motor towards closing initiates the finger closing action, while further opening the fingers from the fully opened state activates the finger rotation. This letter presents the gripper design, the mechanics of the self-motion switching mechanism, the control method, and the grasping configuration selection strategy. The performance of the gripper is experimentally demonstrated.
☆ Guessing human intentions to avoid dangerous situations in caregiving robots IROS
For robots to interact socially, they must interpret human intentions and anticipate their potential outcomes accurately. This is particularly important for social robots designed for human care, which may face potentially dangerous situations for people, such as unseen obstacles in their way, that should be avoided. This paper explores the Artificial Theory of Mind (ATM) approach to inferring and interpreting human intentions. We propose an algorithm that detects risky situations for humans, selecting a robot action that removes the danger in real time. We use the simulation-based approach to ATM and adopt the 'like-me' policy to assign intentions and actions to people. Using this strategy, the robot can detect and act with a high rate of success under time-constrained situations. The algorithm has been implemented as part of an existing robotics cognitive architecture and tested in simulation scenarios. Three experiments have been conducted to test the implementation's robustness, precision and real-time response, including a simulated scenario, a human-in-the-loop hybrid configuration and a real-world scenario.
comment: 8 pages, 6 figures. Submitted to IROS
☆ Combined Task and Motion Planning Via Sketch Decompositions (Extended Version with Supplementary Material)
The challenge in combined task and motion planning (TAMP) is the effective integration of a search over a combinatorial space, usually carried out by a task planner, and a search over a continuous configuration space, carried out by a motion planner. Using motion planners for testing the feasibility of task plans and filling out the details is not effective because it makes the geometrical constraints play a passive role. This work introduces a new interleaved approach for integrating the two dimensions of TAMP that makes use of sketches, a recent simple but powerful language for expressing the decomposition of problems into subproblems. A sketch has width 1 if it decomposes the problem into subproblems that can be solved greedily in linear time. In the paper, a general sketch is introduced for several classes of TAMP problems which has width 1 under suitable assumptions. While sketch decompositions have been developed for classical planning, they offer two important benefits in the context of TAMP. First, when a task plan is found to be unfeasible due to the geometric constraints, the combinatorial search resumes in a specific sub-problem. Second, the sampling of object configurations is not done once, globally, at the start of the search, but locally, at the start of each subproblem. Optimizations of this basic setting are also considered and experimental results over existing and new pick-and-place benchmarks are reported.
☆ M^3RS: Multi-robot, Multi-objective, and Multi-mode Routing and Scheduling
In this paper, we present a novel problem coined multi-robot, multi-objective, and multi-mode routing and scheduling (M^3RS). The formulation for M^3RS is introduced for time-bound multi-robot, multi-objective routing and scheduling missions where each task has multiple execution modes. Different execution modes have distinct resource consumption, associated execution time, and quality. M^3RS assigns the optimal sequence of tasks and the execution modes to each agent. The routes and associated modes depend on user preferences for different objective criteria. The need for M^3RS comes from multi-robot applications in which a trade-off between multiple criteria arises from different task execution modes. We use M^3RS for the application of multi-robot disinfection in public locations. The objectives considered for disinfection application are disinfection quality and number of tasks completed. A mixed-integer linear programming model is proposed for M^3RS. Then, a time-efficient column generation scheme is presented to tackle the issue of computation times for larger problem instances. The advantage of using multiple modes over fixed execution mode is demonstrated using experiments on synthetic data. The results suggest that M^3RS provides flexibility to the user in terms of available solutions and performs well in joint performance metrics. The application of the proposed problem is shown for a team of disinfection robots.} The videos for the experiments are available on the project website: https://sites.google.com/view/g-robot/m3rs/ .
☆ HT-LIP Model based Robust Control of Quadrupedal Robot Locomotion under Unknown Vertical Ground Motion
This paper presents a hierarchical control framework that enables robust quadrupedal locomotion on a dynamic rigid surface (DRS) with general and unknown vertical motions. The key novelty of the framework lies in its higher layer, which is a discrete-time, provably stabilizing footstep controller. The basis of the footstep controller is a new hybrid, time-varying, linear inverted pendulum (HT-LIP) model that is low-dimensional and accurately captures the essential robot dynamics during DRS locomotion. A new set of sufficient stability conditions are then derived to directly guide the controller design for ensuring the asymptotic stability of the HT-LIP model under general, unknown, vertical DRS motions. Further, the footstep controller is cast as a computationally efficient quadratic program that incorporates the proposed HT-LIP model and stability conditions. The middle layer takes the desired footstep locations generated by the higher layer as input to produce kinematically feasible full-body reference trajectories, which are then accurately tracked by a lower-layer torque controller. Hardware experiments on a Unitree Go1 quadrupedal robot confirm the robustness of the proposed framework under various unknown, aperiodic, vertical DRS motions and uncertainties (e.g., slippery and uneven surfaces, solid and liquid loads, and sudden pushes).
☆ Legged Robot State Estimation within Non-inertial Environments
This paper investigates the robot state estimation problem within a non-inertial environment. The proposed state estimation approach relaxes the common assumption of static ground in the system modeling. The process and measurement models explicitly treat the movement of the non-inertial environments without requiring knowledge of its motion in the inertial frame or relying on GPS or sensing environmental landmarks. Further, the proposed state estimator is formulated as an invariant extended Kalman filter (InEKF) with the deterministic part of its process model obeying the group-affine property, leading to log-linear error dynamics. The observability analysis of the filter confirms that the robot's pose (i.e., position and orientation) and velocity relative to the non-inertial environment are observable. Hardware experiments on a humanoid robot moving on a rotating and translating treadmill demonstrate the high convergence rate and accuracy of the proposed InEKF even under significant treadmill pitch sway, as well as large estimation errors.
☆ KITchen: A Real-World Benchmark and Dataset for 6D Object Pose Estimation in Kitchen Environments
Despite the recent progress on 6D object pose estimation methods for robotic grasping, a substantial performance gap persists between the capabilities of these methods on existing datasets and their efficacy in real-world mobile manipulation tasks, particularly when robots rely solely on their monocular egocentric field of view (FOV). Existing real-world datasets primarily focus on table-top grasping scenarios, where a robotic arm is placed in a fixed position and the objects are centralized within the FOV of fixed external camera(s). Assessing performance on such datasets may not accurately reflect the challenges encountered in everyday mobile manipulation tasks within kitchen environments such as retrieving objects from higher shelves, sinks, dishwashers, ovens, refrigerators, or microwaves. To address this gap, we present Kitchen, a novel benchmark designed specifically for estimating the 6D poses of objects located in diverse positions within kitchen settings. For this purpose, we recorded a comprehensive dataset comprising around 205k real-world RGBD images for 111 kitchen objects captured in two distinct kitchens, utilizing one humanoid robot with its egocentric perspectives. Subsequently, we developed a semi-automated annotation pipeline, to streamline the labeling process of such datasets, resulting in the generation of 2D object labels, 2D object segmentation masks, and 6D object poses with minimized human effort. The benchmark, the dataset, and the annotation pipeline are available at https://kitchen-dataset.github.io/KITchen.
☆ Mixed-Initiative Human-Robot Teaming under Suboptimality with Online Bayesian Adaptation
For effective human-agent teaming, robots and other artificial intelligence (AI) agents must infer their human partner's abilities and behavioral response patterns and adapt accordingly. Most prior works make the unrealistic assumption that one or more teammates can act near-optimally. In real-world collaboration, humans and autonomous agents can be suboptimal, especially when each only has partial domain knowledge. In this work, we develop computational modeling and optimization techniques for enhancing the performance of suboptimal human-agent teams, where the human and the agent have asymmetric capabilities and act suboptimally due to incomplete environmental knowledge. We adopt an online Bayesian approach that enables a robot to infer people's willingness to comply with its assistance in a sequential decision-making game. Our user studies show that user preferences and team performance indeed vary with robot intervention styles, and our approach for mixed-initiative collaborations enhances objective team performance ($p<.001$) and subjective measures, such as user's trust ($p<.001$) and perceived likeability of the robot ($p<.001$).
comment: 8 pages, 4 pages for supplementary
☆ Realtime Robust Shape Estimation of Deformable Linear Object ICRA 2024
Realtime shape estimation of continuum objects and manipulators is essential for developing accurate planning and control paradigms. The existing methods that create dense point clouds from camera images, and/or use distinguishable markers on a deformable body have limitations in realtime tracking of large continuum objects/manipulators. The physical occlusion of markers can often compromise accurate shape estimation. We propose a robust method to estimate the shape of linear deformable objects in realtime using scattered and unordered key points. By utilizing a robust probability-based labeling algorithm, our approach identifies the true order of the detected key points and then reconstructs the shape using piecewise spline interpolation. The approach only relies on knowing the number of the key points and the interval between two neighboring points. We demonstrate the robustness of the method when key points are partially occluded. The proposed method is also integrated into a simulation in Unity for tracking the shape of a cable with a length of 1m and a radius of 5mm. The simulation results show that our proposed approach achieves an average length error of 1.07% over the continuum's centerline and an average cross-section error of 2.11mm. The real-world experiments of tracking and estimating a heavy-load cable prove that the proposed approach is robust under occlusion and complex entanglement scenarios.
comment: This paper has been accepted to IEEE ICRA 2024 as a contributed paper
☆ CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field
Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and mapping (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and mapping. Additionally, a novel depth uncertainty model is proposed to ensure the selection of valuable Gaussian primitives during optimization, thereby improving tracking efficiency and accuracy. Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and mapping performance with a notable tracking speed of up to 15 Hz. We will make our source code publicly available. Project page: https://zju3dv.github.io/cg-slam.
comment: Project Page: https://zju3dv.github.io/cg-slam
☆ Are NeRFs ready for autonomous driving? Towards closing the real-to-simulation gap
Neural Radiance Fields (NeRFs) have emerged as promising tools for advancing autonomous driving (AD) research, offering scalable closed-loop simulation and data augmentation capabilities. However, to trust the results achieved in simulation, one needs to ensure that AD systems perceive real and rendered data in the same way. Although the performance of rendering methods is increasing, many scenarios will remain inherently challenging to reconstruct faithfully. To this end, we propose a novel perspective for addressing the real-to-simulated data gap. Rather than solely focusing on improving rendering fidelity, we explore simple yet effective methods to enhance perception model robustness to NeRF artifacts without compromising performance on real data. Moreover, we conduct the first large-scale investigation into the real-to-simulated data gap in an AD setting using a state-of-the-art neural rendering technique. Specifically, we evaluate object detectors and an online mapping model on real and simulated data, and study the effects of different pre-training strategies. Our results show notable improvements in model robustness to simulated data, even improving real-world performance in some cases. Last, we delve into the correlation between the real-to-simulated gap and image reconstruction metrics, identifying FID and LPIPS as strong indicators.
☆ RPMArt: Towards Robust Perception and Manipulation for Articulated Objects IROS 2024
Articulated objects are commonly found in daily life. It is essential that robots can exhibit robust perception and manipulation skills for articulated objects in real-world robotic applications. However, existing methods for articulated objects insufficiently address noise in point clouds and struggle to bridge the gap between simulation and reality, thus limiting the practical deployment in real-world scenarios. To tackle these challenges, we propose a framework towards Robust Perception and Manipulation for Articulated Objects (RPMArt), which learns to estimate the articulation parameters and manipulate the articulation part from the noisy point cloud. Our primary contribution is a Robust Articulation Network (RoArtNet) that is able to predict both joint parameters and affordable points robustly by local feature learning and point tuple voting. Moreover, we introduce an articulation-aware classification scheme to enhance its ability for sim-to-real transfer. Finally, with the estimated affordable point and articulation joint constraint, the robot can generate robust actions to manipulate articulated objects. After learning only from synthetic data, RPMArt is able to transfer zero-shot to real-world articulated objects. Experimental results confirm our approach's effectiveness, with our framework achieving state-of-the-art performance in both noise-added simulation and real-world environments. The code and data will be open-sourced for reproduction. More results are published on the project website at https://r-pmart.github.io .
comment: 8 pages, 7 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024), project website at https://r-pmart.github.io
☆ MQE: Unleashing the Power of Interaction with Multi-agent Quadruped Environment
The advent of deep reinforcement learning (DRL) has significantly advanced the field of robotics, particularly in the control and coordination of quadruped robots. However, the complexity of real-world tasks often necessitates the deployment of multi-robot systems capable of sophisticated interaction and collaboration. To address this need, we introduce the Multi-agent Quadruped Environment (MQE), a novel platform designed to facilitate the development and evaluation of multi-agent reinforcement learning (MARL) algorithms in realistic and dynamic scenarios. MQE emphasizes complex interactions between robots and objects, hierarchical policy structures, and challenging evaluation scenarios that reflect real-world applications. We present a series of collaborative and competitive tasks within MQE, ranging from simple coordination to complex adversarial interactions, and benchmark state-of-the-art MARL algorithms. Our findings indicate that hierarchical reinforcement learning can simplify task learning, but also highlight the need for advanced algorithms capable of handling the intricate dynamics of multi-agent interactions. MQE serves as a stepping stone towards bridging the gap between simulation and practical deployment, offering a rich environment for future research in multi-agent systems and robot learning. For open-sourced code and more details of MQE, please refer to https://ziyanx02.github.io/multiagent-quadruped-environment/ .
comment: Open-source code is available at https://github.com/ziyanx02/multiagent-quadruped-environment
☆ Robust-Locomotion-by-Logic: Perturbation-Resilient Bipedal Locomotion via Signal Temporal Logic Guided Model Predictive Control
This study introduces a robust planning framework that utilizes a model predictive control (MPC) approach, enhanced by incorporating signal temporal logic (STL) specifications. This marks the first-ever study to apply STL-guided trajectory optimization for bipedal locomotion, specifically designed to handle both translational and orientational perturbations. Existing recovery strategies often struggle with reasoning complex task logic and evaluating locomotion robustness systematically, making them susceptible to failures caused by inappropriate recovery strategies or lack of robustness. To address these issues, we design an analytical robustness metric for bipedal locomotion and quantify this metric using STL specifications, which guide the generation of recovery trajectories to achieve maximum locomotion robustness. To enable safe and computational-efficient crossed-leg maneuver, we design data-driven self-leg-collision constraints that are $1000$ times faster than the traditional inverse-kinematics-based approach. Our framework outperforms a state-of-the-art locomotion controller, a standard MPC without STL, and a linear-temporal-logic-based planner in a high-fidelity dynamic simulation, especially in scenarios involving crossed-leg maneuvers. Additionally, the Cassie bipedal robot achieves robust performance under horizontal and orientational perturbations such as those observed in ship motions. These environments are validated in simulations and deployed on hardware. Furthermore, our proposed method demonstrates versatility on stepping stones and terrain-agnostic features on inclined terrains.
♻ ☆ Make a Donut: Hierarchical EMD-Space Planning for Zero-Shot Deformable Manipulation with Tools
Deformable object manipulation stands as one of the most captivating yet formidable challenges in robotics. While previous techniques have predominantly relied on learning latent dynamics through demonstrations, typically represented as either particles or images, there exists a pertinent limitation: acquiring suitable demonstrations, especially for long-horizon tasks, can be elusive. Moreover, basing learning entirely on demonstrations can hamper the model's ability to generalize beyond the demonstrated tasks. In this work, we introduce a demonstration-free hierarchical planning approach capable of tackling intricate long-horizon tasks without necessitating any training. We employ large language models (LLMs) to articulate a high-level, stage-by-stage plan corresponding to a specified task. For every individual stage, the LLM provides both the tool's name and the Python code to craft intermediate subgoal point clouds. With the tool and subgoal for a particular stage at our disposal, we present a granular closed-loop model predictive control strategy. This leverages Differentiable Physics with Point-to-Point correspondence (DiffPhysics-P2P) loss in the earth mover distance (EMD) space, applied iteratively. Experimental findings affirm that our technique surpasses multiple benchmarks in dough manipulation, spanning both short and long horizons. Remarkably, our model demonstrates robust generalization capabilities to novel and previously unencountered complex tasks without any preliminary demonstrations. We further substantiate our approach with experimental trials on real-world robotic platforms. Our project page: https://qq456cvb.github.io/projects/donut.
comment: 8 pages
♻ ☆ Development and Evaluation of a Learning-based Model for Real-time Haptic Texture Rendering
Current Virtual Reality (VR) environments lack the rich haptic signals that humans experience during real-life interactions, such as the sensation of texture during lateral movement on a surface. Adding realistic haptic textures to VR environments requires a model that generalizes to variations of a user's interaction and to the wide variety of existing textures in the world. Current methodologies for haptic texture rendering exist, but they usually develop one model per texture, resulting in low scalability. We present a deep learning-based action-conditional model for haptic texture rendering and evaluate its perceptual performance in rendering realistic texture vibrations through a multi part human user study. This model is unified over all materials and uses data from a vision-based tactile sensor (GelSight) to render the appropriate surface conditioned on the user's action in real time. For rendering texture, we use a high-bandwidth vibrotactile transducer attached to a 3D Systems Touch device. The result of our user study shows that our learning-based method creates high-frequency texture renderings with comparable or better quality than state-of-the-art methods without the need for learning a separate model per texture. Furthermore, we show that the method is capable of rendering previously unseen textures using a single GelSight image of their surface.
comment: Accepted for publication in IEEE Transactions on Haptics 2024. 12 pages, 8 figures
♻ ☆ Generative Graphical Inverse Kinematics
Quickly and reliably finding accurate inverse kinematics (IK) solutions remains a challenging problem for many robot manipulators. Existing numerical solvers are broadly applicable but typically only produce a single solution and rely on local search techniques to minimize nonconvex objective functions. More recent learning-based approaches that approximate the entire feasible set of solutions have shown promise as a means to generate multiple fast and accurate IK results in parallel. However, existing learning-based techniques have a significant drawback: each robot of interest requires a specialized model that must be trained from scratch. To address this key shortcoming, we propose a novel distance-geometric robot representation coupled with a graph structure that allows us to leverage the sample efficiency of Euclidean equivariant functions and the generalizability of graph neural networks (GNNs). Our approach is generative graphical inverse kinematics (GGIK), the first learned IK solver able to accurately and efficiently produce a large number of diverse solutions in parallel while also displaying the ability to generalize -- a single learned model can be used to produce IK solutions for a variety of different robots. When compared to several other learned IK methods, GGIK provides more accurate solutions with the same amount of data. GGIK can generalize reasonably well to robot manipulators unseen during training. Additionally, GGIK can learn a constrained distribution that encodes joint limits and scales efficiently to larger robots and a high number of sampled solutions. Finally, GGIK can be used to complement local IK solvers by providing reliable initializations for a local optimization process.
comment: Submitted to IEEE Transactions on Robotics, June 2023
♻ ☆ SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs ICRA 2024
Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
comment: ICRA 2024 accepted. Project website: https://sites.google.com/view/sg-bot
♻ ☆ EnduRL: Enhancing Safety, Stability, and Efficiency of Mixed Traffic Under Real-World Perturbations Via Reinforcement Learning
Human-driven vehicles (HVs) amplify naturally occurring perturbations in traffic, leading to congestion--a major contributor to increased fuel consumption, higher collision risks, and reduced road capacity utilization. While previous research demonstrates that Robot Vehicles (RVs) can be leveraged to mitigate these issues, most such studies rely on simulations with simplistic models of human car-following behaviors. In this work, we analyze real-world driving trajectories and extract a wide range of acceleration profiles. We then incorporates these profiles into simulations for training RVs to mitigate congestion. We evaluate the safety, efficiency, and stability of mixed traffic via comprehensive experiments conducted in two mixed traffic environments (Ring and Bottleneck) at various traffic densities, configurations, and RV penetration rates. The results show that under real-world perturbations, prior RV controllers experience performance degradation on all three objectives (sometimes even lower than 100% HVs). To address this, we introduce a reinforcement learning based RV that employs a congestion stage classifier to optimize the safety, efficiency, and stability of mixed traffic. Our RVs demonstrate significant improvements: safety by up to 66%, efficiency by up to 54%, and stability by up to 97%.
♻ ☆ Soft finger rotational stability for precision grasps IROS24
Soft robotic fingers can safely grasp fragile or variable form objects, but their force capacity is limited, especially with less contact area: precision grasps and when objects are smaller or not spherical. Current research is improving force capacity through mechanical design by increasing contact area or stiffness, typically without models which explain soft finger force limitations. To address this, this paper considers two types of soft grip failure, slip and dynamic rotational stability. For slip, the validity of a Coulomb model investigated, identifying the effect of contact area, pressure, and relative pose. For rotational stability, bulk linear stiffness of the fingers is used to develop conditions for dynamic stability and identify when rotation leads to slip. Together, these models suggest contact area improves force capacity by increasing transverse stiffness and normal force. The models are validated on pneumatic fingers, both custom PneuNets-based and commercially available. The models are used to find grip parameters which increase force capacity without failure.
comment: Submitted IROS24
♻ ☆ Combining Sampling- and Gradient-based Planning for Contact-rich Manipulation ICRA24
Planning over discontinuous dynamics is needed for robotics tasks like contact-rich manipulation, which presents challenges in the numerical stability and speed of planning methods when either neural network or analytical models are used. On the one hand, sampling-based planners require higher sample complexity in high-dimensional problems and cannot describe safety constraints such as force limits. On the other hand, gradient-based solvers can suffer from local optima and convergence issues when the Hessian is poorly conditioned. We propose a planning method with both sampling- and gradient-based elements, using the Cross-entropy Method to initialize a gradient-based solver, providing better search over local minima and the ability to handle explicit constraints. We show the approach allows smooth, stable contact-rich planning for an impedance-controlled robot making contact with a stiff environment, benchmarking against gradient-only MPC and CEM.
comment: Submitted ICRA24. Video available at https://youtu.be/COqR90392Kw Code available at https://gitlab.cc-asp.fraunhofer.de/hanikevi/contact_mpc
♻ ☆ SwarmPRM: Probabilistic Roadmap Motion Planning for Large-Scale Swarm Robotic Systems IROS 2024
Large-scale swarm robotic systems consisting of numerous cooperative agents show considerable promise for performing autonomous tasks across various sectors. Nonetheless, traditional motion planning approaches often face a trade-off between scalability and solution quality due to the exponential growth of the joint state space of robots. In response, this work proposes SwarmPRM, a hierarchical, scalable, computationally efficient, and risk-aware sampling-based motion planning approach for large-scale swarm robots. SwarmPRM utilizes a Gaussian Mixture Model (GMM) to represent the swarm's macroscopic state and constructs a Probabilistic Roadmap in Gaussian space, referred to as the Gaussian roadmap, to generate a transport trajectory of GMM. This trajectory is then followed by each robot at the microscopic stage. To enhance trajectory safety, SwarmPRM incorporates the conditional value-at-risk (CVaR) in the collision checking process to impart the property of risk awareness to the constructed Gaussian roadmap. SwarmPRM then crafts a linear programming formulation to compute the optimal GMM transport trajectory within this roadmap. Extensive simulations demonstrate that SwarmPRM outperforms state-of-the-art methods in computational efficiency, scalability, and trajectory quality while offering the capability to adjust the risk tolerance of generated trajectories.
comment: Submitted to IROS 2024
♻ ☆ DRL-Based Trajectory Tracking for Motion-Related Modules in Autonomous Driving
Autonomous driving systems are always built on motion-related modules such as the planner and the controller. An accurate and robust trajectory tracking method is indispensable for these motion-related modules as a primitive routine. Current methods often make strong assumptions about the model such as the context and the dynamics, which are not robust enough to deal with the changing scenarios in a real-world system. In this paper, we propose a Deep Reinforcement Learning (DRL)-based trajectory tracking method for the motion-related modules in autonomous driving systems. The representation learning ability of DL and the exploration nature of RL bring strong robustness and improve accuracy. Meanwhile, it enhances versatility by running the trajectory tracking in a model-free and data-driven manner. Through extensive experiments, we demonstrate both the efficiency and effectiveness of our method compared to current methods. Code and documentation are released to facilitate both further research and industrial deployment.
comment: Technical report. Code: https://github.com/MARMOTatZJU/drl-based-trajectory-tracking Documentation: https://drl-based-trajectory-tracking.readthedocs.io
♻ ☆ A Number Sense as an Emergent Property of the Manipulating Brain
The ability to understand and manipulate numbers and quantities emerges during childhood, but the mechanism through which humans acquire and develop this ability is still poorly understood. We explore this question through a model, assuming that the learner is able to pick up and place small objects from, and to, locations of its choosing, and will spontaneously engage in such undirected manipulation. We further assume that the learner's visual system will monitor the changing arrangements of objects in the scene and will learn to predict the effects of each action by comparing perception with a supervisory signal from the motor system. We model perception using standard deep networks for feature extraction and classification, and gradient descent learning. Our main finding is that, from learning the task of action prediction, an unexpected image representation emerges exhibiting regularities that foreshadow the perception and representation of numbers and quantity. These include distinct categories for zero and the first few natural numbers, a strict ordering of the numbers, and a one-dimensional signal that correlates with numerical quantity. As a result, our model acquires the ability to estimate numerosity, i.e. the number of objects in the scene, as well as subitization, i.e. the ability to recognize at a glance the exact number of objects in small scenes. Remarkably, subitization and numerosity estimation extrapolate to scenes containing many objects, far beyond the three objects used during training. We conclude that important aspects of a facility with numbers and quantities may be learned with supervision from a simple pre-training task. Our observations suggest that cross-modal learning is a powerful learning mechanism that may be harnessed in artificial intelligence.
comment: 16 pages, 5 figures, 15 supplemental figures
♻ ☆ Effective Integration of Weighted Cost-to-go and Conflict Heuristic within Suboptimal CBS AAAI 2023
Conflict-Based Search (CBS) is a popular multi-agent path finding (MAPF) solver that employs a low-level single agent planner and a high-level constraint tree to resolve conflicts. The vast majority of modern MAPF solvers focus on improving CBS by reducing the size of this tree through various strategies with few methods modifying the low level planner. Typically low level planners in existing CBS methods use an unweighted cost-to-go heuristic, with suboptimal CBS methods also using a conflict heuristic to help the high level search. In this paper, we show that, contrary to prevailing CBS beliefs, a weighted cost-to-go heuristic can be used effectively alongside the conflict heuristic in two possible variants. In particular, one of these variants can obtain large speedups, 2-100x, across several scenarios and suboptimal CBS methods. Importantly, we discover that performance is related not to the weighted cost-to-go heuristic but rather to the relative conflict heuristic weight's ability to effectively balance low-level and high-level work. Additionally, to the best of our knowledge, we show the first theoretical relation of prioritized planning and bounded suboptimal CBS and demonstrate that our methods are their natural generalization. Update March 2024: We found that the relative speedup decreases to around 1.2-10x depending on how the conflict heuristic is computed (see appendix for more details).
comment: Published in AAAI 2023
Databases
☆ On Reporting Durable Patterns in Temporal Proximity Graphs
Finding patterns in graphs is a fundamental problem in databases and data mining. In many applications, graphs are temporal and evolve over time, so we are interested in finding durable patterns, such as triangles and paths, which persist over a long time. While there has been work on finding durable simple patterns, existing algorithms do not have provable guarantees and run in strictly super-linear time. The paper leverages the observation that many graphs arising in practice are naturally proximity graphs or can be approximated as such, where nodes are embedded as points in some high-dimensional space, and two nodes are connected by an edge if they are close to each other. We work with an implicit representation of the proximity graph, where nodes are additionally annotated by time intervals, and design near-linear-time algorithms for finding (approximately) durable patterns above a given durability threshold. We also consider an interactive setting where a client experiments with different durability thresholds in a sequence of queries; we show how to compute incremental changes to result patterns efficiently in time near-linear to the size of the changes.
☆ SQL-Encoder: Improving NL2SQL In-Context Learning Through a Context-Aware Encoder
Detecting structural similarity between queries is essential for selecting examples in in-context learning models. However, assessing structural similarity based solely on the natural language expressions of queries, without considering SQL queries, presents a significant challenge. This paper explores the significance of this similarity metric and proposes a model for accurately estimating it. To achieve this, we leverage a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model. Our comprehensive evaluation demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics. Notably, our model outperforms strong competitive embedding models from OpenAI and Cohere. Furthermore, compared to these competitive models, our proposed encoder enhances the downstream performance of NL2SQL models in 1-shot in-context learning scenarios by 1-2\% for GPT-3.5-turbo, 4-8\% for CodeLlama-7B, and 2-3\% for CodeLlama-13B.
☆ ByteCard: Enhancing Data Warehousing with Learned Cardinality Estimation
Cardinality estimation is a critical component and a longstanding challenge in modern data warehouses. ByteHouse, ByteDance's cloud-native engine for big data analysis in exabyte-scale environments, serves numerous internal decision-making business scenarios. With the increasing demand of ByteHouse, cardinality estimation becomes the bottleneck for efficiently processing queries. Specifically, the existing query optimizer of ByteHouse uses the traditional Selinger-like cardinality estimator, which can produce huge estimation errors, resulting in sub-optimal query plans. To improve cardinality estimation accuracy while maintaining a practical inference overhead, we develop ByteCard framework that enables efficient training/updating and integration of cardinality estimators. Furthermore, ByteCard adapts recent advances in cardinality estimation to build models that can balance accuracy and practicality (e.g., inference latency, model size, training/updating overhead). We observe significant query processing speed-up in ByteHouse after replacing the system's existing cardinality estimation with ByteCard's estimations for several optimization strategies. Evaluations on real-world datasets show the integration of ByteCard leads to an improvement of up to 30% in the 99th quantile of latency. At last, we share our valuable experience in engineering advanced cardinality estimators. We believe this experience can help other data warehouses integrate more accurate and sophisticated solutions on the critical path of query execution.
♻ ☆ Insert-Only versus Insert-Delete in Dynamic Query Evaluation
We study the dynamic query evaluation problem: Given a join query Q and a stream of updates, we would like to construct a data structure that supports constant-delay enumeration of the query output after each update. We show that a stream of N insert-only updates (to an initially empty database) can be executed in total time O(N^{w(Q)}), where w(Q) is the fractional hypertree width of Q. This matches the complexity of the static query evaluation problem for Q and a database of size N. One corollary is that the average time per single-tuple insert is constant for acyclic joins. In contrast, we show that a stream of N insert-and-delete updates to Q can be executed in total time O(N^{w(Q')}), where Q' is obtained from Q by extending every relational atom with extra variables that represent the "lifespans" of tuples in Q. We show that this reduction is optimal in the sense that the static evaluation runtime of Q' provides a lower bound on the total update time of Q. Our approach recovers the optimal single-tuple update time for known queries such as the hierarchical and Loomis-Whitney join queries.
♻ ☆ Enhancing Grover's Search Algorithm: A Modified Approach to Increase the Probability of Good States
This article introduces an enhancement to the Grover search algorithm to speed up computing the probability of finding good states. It suggests incorporating a rotation phase angle determined mathematically from the derivative of the model during the initial iteration. At each iteration, a new phase angle is computed and used in a rotation gate around y+z axis in the diffusion operator. The computed phase angles are optimized through an adaptive adjustment based on the estimated increasing ratio of the consecutive amplitudes. The findings indicate an average decrease of 28% in the required number of iterations resulting in a faster overall process and fewer number of quantum gates. For large search space, this improvement rises to 29.58%. Given the computational capabilities of the computer utilized for the simulation, the approach is applied to instances with up to 12 qubits or 4096 possible combination of search entries.
♻ ☆ Finding Smallest Witnesses for Conjunctive Queries
A witness is a sub-database that preserves the query results of the original database but of much smaller size. It has wide applications in query rewriting and debugging, query explanation, IoT analytics, multi-layer network routing, etc. In this paper, we study the smallest witness problem (SWP) for the class of conjunctive queries (CQs) without self-joins. We first establish the dichotomy that SWP for a CQ can be computed in polynomial time if and only if it has {\em head-cluster property}, unless $\texttt{P} = \texttt{NP}$. We next turn to the approximated version by relaxing the size of a witness from being minimum. We surprisingly find that the {\em head-domination} property - that has been identified for the deletion propagation problem \cite{kimelfeld2012maximizing} - can also precisely capture the hardness of the approximated smallest witness problem. In polynomial time, SWP for any CQ with head-domination property can be approximated within a constant factor, while SWP for any CQ without such a property cannot be approximated within a logarithmic factor, unless $\texttt{P} = \texttt{NP}$. We further explore efficient approximation algorithms for CQs without head-domination property: (1) we show a trivial algorithm which achieves a polynomially large approximation ratio for general CQs; (2) for any CQ with only one non-output attribute, such as star CQs, we show a greedy algorithm with a logarithmic approximation ratio; (3) for line CQs, which contain at least two non-output attributes, we relate SWP problem to the directed steiner forest problem, whose algorithms can be applied to line CQs directly. Meanwhile, we establish a much higher lower bound, exponentially larger than the logarithmic lower bound obtained above. It remains open to close the gap between the lower and upper bound of the approximated SWP for CQs without head-domination property.
Systems and Control
Datasets of Great Britain Primary Substations Integrated with Household Heating Information
The growing demand for electrified heating, electrified transportation, and power-intensive data centres challenge distribution networks. If electrification projects are carried out without considering electrical distribution infrastructure, there could be unexpected blackouts and financial losses. Datasets containing real-world distribution network information are required to address this. On the other hand, social data, such as household heating composition, are closely coupled with people's lives. Studying the coupling between the energy system and society is important in promoting social welfare. To fill these gaps, this paper introduces two datasets. The first is the main dataset for the distribution networks in Great Britain (GB), collecting information on firm capacity, peak demands, locations, and parent transmission nodes (the Grid Supply Point, namely GSP) for all primary substations (PSs). PSs are a crucial part of the UK distribution network and are at the lowest voltage level (11 kV) with publicly available data for most UK Distribution Network Operators (DNOs). Substation firm capacity and peak demand facilitate an understanding of the remaining room of the existing network. The parent GSP information helps link the dataset of distribution networks to datasets of transmission networks. The second dataset extends the main network dataset, linking each PS to information about the number of households that use different types of central heating recorded in census data. The derivation of the second dataset is based on locations of PSs collected in the main dataset with appropriate assumptions. The derivation process may also be replicated to integrate other social datasets.
comment: Submitted to the journal "Data in Brief"
☆ ANN-Based Adaptive NMPC for Uranium Extraction-Scrubbing Operation in Spent Nuclear Fuel Treatment Process
This paper addresses the particularities in optimal control of the uranium extraction-scrubbing operation in the PUREX process. The control problem requires optimally stabilizing the system at a desired solvent saturation level, guaranteeing constraints, disturbance rejection, and adapting to set point variations. A qualified simulator named PAREX was developed by the French Alternative Energies and Atomic Energy Commission (CEA) to simulate liquid-liquid extraction operations in the PUREX process. However, since the mathematical model is complex and is described by a system of nonlinear, stiff, high-dimensional differential-algebraic equations (DAE), applying optimal control methods will lead to a large-scale nonlinear programming problem with a huge computational burden. The solution we propose in this work is to train a neural network to predict the process outputs using the measurement history. This neural network architecture, which employs the long short-term memory (LSTM), linear regression and logistic regression networks, allows reducing the number of state variables, thus reducing the complexity of the optimization problems in the control scheme. Furthermore, nonlinear model predictive control (NMPC) and moving horizon estimation (MHE) problems are developed and solved using the PSO (Particle Swarm Optimization) algorithm. Simulation results show that the proposed adaptive optimal control scheme satisfies the requirements of the control problem and provides promise for experimental testing.
☆ Control-Coherent Koopman Modeling: A Physical Modeling Approach
The modeling of nonlinear dynamics based on Koopman operator theory, which is originally applicable only to autonomous systems with no control, is extended to non-autonomous control system without approximation to input matrix B. Prevailing methods using a least square estimate of the B matrix may result in an erroneous input matrix, misinforming the controller about the structure of the input matrix in a lifted space. Here, a new method for constructing a Koopman model that comprises the exact input matrix B is presented. A set of state variables are introduced so that the control inputs are linearly involved in the dynamics of actuators. With these variables, a lifted linear model with the exact control matrix, called a Control-Coherent Koopman Model, is constructed by superposing control input terms, which are linear in local actuator dynamics, to the Koopman operator of the associated autonomous nonlinear system. The proposed method is applied to multi degree-of-freedom robotic arms and multi-cable manipulation systems. Model Predictive Control is applied to the former. It is demonstrated that the prevailing Dynamic Mode Decomposition with Control (DMDc) using an approximate control matrix B does not provide a satisfactory result, while the Control-Coherent Koopman Model performs well with the correct B matrix.
☆ Semi-Automatic Line-System Provisioning with Integrated Physical-Parameter-Aware Methodology: Field Verification and Operational Feasibility
We propose methods and an architecture to conduct measurements and optimize newly installed optical fiber line systems semi-automatically using integrated physics-aware technologies in a data center interconnection (DCI) transmission scenario. We demonstrate, for the first time, digital longitudinal monitoring (DLM) and optical line system (OLS) physical parameter calibration working together in real-time to extract physical link parameters for transmission performance optimization. Our methodology has the following advantages over traditional design: a minimized footprint at user sites, accurate estimation of the necessary optical network characteristics via complementary telemetry technologies, and the capability to conduct all operation work remotely. The last feature is crucial, as it enables remote operation to implement network design settings for immediate response to quality of transmission (QoT) degradation and reversion in the case of unforeseen problems. We successfully performed semi-automatic line system provisioning over field fiber networks facilities at Duke University, Durham, NC. The tasks of parameter retrieval, equipment setting optimization, and system setup/provisioning were completed within 1 hour. The field operation was supervised by on-duty personnel who could access the system remotely from different time zones. By comparing Q-factor estimates calculated from the extracted link parameters with measured results from 400G transceivers, we confirmed that our methodology has a reduction in the QoT prediction errors (+-0.3 dB) over existing design (+-10.6 dB).
☆ HT-LIP Model based Robust Control of Quadrupedal Robot Locomotion under Unknown Vertical Ground Motion
This paper presents a hierarchical control framework that enables robust quadrupedal locomotion on a dynamic rigid surface (DRS) with general and unknown vertical motions. The key novelty of the framework lies in its higher layer, which is a discrete-time, provably stabilizing footstep controller. The basis of the footstep controller is a new hybrid, time-varying, linear inverted pendulum (HT-LIP) model that is low-dimensional and accurately captures the essential robot dynamics during DRS locomotion. A new set of sufficient stability conditions are then derived to directly guide the controller design for ensuring the asymptotic stability of the HT-LIP model under general, unknown, vertical DRS motions. Further, the footstep controller is cast as a computationally efficient quadratic program that incorporates the proposed HT-LIP model and stability conditions. The middle layer takes the desired footstep locations generated by the higher layer as input to produce kinematically feasible full-body reference trajectories, which are then accurately tracked by a lower-layer torque controller. Hardware experiments on a Unitree Go1 quadrupedal robot confirm the robustness of the proposed framework under various unknown, aperiodic, vertical DRS motions and uncertainties (e.g., slippery and uneven surfaces, solid and liquid loads, and sudden pushes).
☆ Legged Robot State Estimation within Non-inertial Environments
This paper investigates the robot state estimation problem within a non-inertial environment. The proposed state estimation approach relaxes the common assumption of static ground in the system modeling. The process and measurement models explicitly treat the movement of the non-inertial environments without requiring knowledge of its motion in the inertial frame or relying on GPS or sensing environmental landmarks. Further, the proposed state estimator is formulated as an invariant extended Kalman filter (InEKF) with the deterministic part of its process model obeying the group-affine property, leading to log-linear error dynamics. The observability analysis of the filter confirms that the robot's pose (i.e., position and orientation) and velocity relative to the non-inertial environment are observable. Hardware experiments on a humanoid robot moving on a rotating and translating treadmill demonstrate the high convergence rate and accuracy of the proposed InEKF even under significant treadmill pitch sway, as well as large estimation errors.
☆ Bi-Level Control of Weaving Sections in Mixed Traffic Environments with Connected and Automated Vehicles
Connected and automated vehicles (CAVs) can be beneficial for improving the operation of highway bottlenecks such as weaving sections. This paper proposes a bi-level control approach based on an upper-level deep reinforcement learning controller and a lower-level model predictive controller to coordinate the lane-changings of a mixed fleet of CAVs and human-driven vehicles (HVs) in weaving sections. The upper level represents a roadside controller that collects vehicular information from the entire weaving section and determines the control weights used in the lower-level controller. The lower level is implemented within each CAV, which takes the control weights from the upper-level controller and generates the acceleration and steering angle for individual CAVs based on the local situation. The lower-level controller further incorporates an HV trajectory predictor, which is capable of handling the dynamic topology of vehicles in weaving scenarios with intensive mandatory lane changes. The case study inspired by a real weaving section in Basel, Switzerland, shows that our method consistently outperforms state-of-the-art benchmarks.
comment: 12 pages, 8 figures
☆ Efficient Reachable Sets on Lie Groups Using Lie Algebra Monotonicity and Tangent Intervals
In this paper, we efficiently compute overapproximated reachable sets for control systems evolving on Lie groups, building off results from monotone systems theory and geometric integration theory. We propose to consider intervals living in the Lie algebra, which through the exponential map, describe real sets on the Lie group. A local equivalence between the original system and a system evolving on the Lie algebra allows existing interval reachability techniques to apply in the tangent space. Using interval bounds of the Baker-Campbell-Hausdorff formula, a Runge-Kutta-Munthe-Kaas reachability algorithm is proposed, providing reachable set estimates for arbitrary time horizons at little computational cost. The algorithm is demonstrated on through consensus on a torus and attitude control on $SO(3)$.
☆ Voltage Regulation in Polymer Electrolyte Fuel Cell Systems Using Gaussian Process Model Predictive Control
This study introduces a novel approach utilizing Gaussian process model predictive control (MPC) to stabilize the output voltage of a polymer electrolyte fuel cell (PEFC) system by simultaneously regulating hydrogen and airflow rates. Two Gaussian process models are developed to capture PEFC dynamics, taking into account constraints including hydrogen pressure and input change rates, thereby aiding in mitigating errors inherent to PEFC predictive control. The dynamic performance of the physical model and Gaussian process MPC in constraint handling and system inputs is compared and analyzed. Simulation outcomes demonstrate that the proposed Gaussian process MPC effectively maintains the voltage at the target 48 V while adhering to safety constraints, even amidst workload disturbances ranging from 110-120 A. In comparison to traditional MPC using detailed system models, Gaussian process MPC exhibits a 43\% higher overshoot and 25\% slower response time. Nonetheless, it offers the advantage of not requiring the underlying true system model and needing less system information.
☆ Input-to-State Stability of Newton Methods for Generalized Equations in Nonlinear Optimization
We show that Newton methods for generalized equations are input-to-state stable with respect to disturbances such as due to inexact computations. We then use this result to obtain convergence and robustness of a multistep Newton-type method for multivariate generalized equations. We demonstrate the usefulness of the results with other applications to nonlinear optimization. In particular, we provide a new proof for (robust) local convergence of the augmented Lagrangian method.
comment: Submitted to 2024 Conference on Decision and Control
☆ Data-Driven Sliding Mode Control for Partially Unknown Nonlinear Systems
This paper introduces a new design method for data-driven control of nonlinear systems with partially unknown dynamics and unknown bounded disturbance. Since it is not possible to achieve exact nonlinearity cancellation in the presence of unknown disturbance, this paper adapts the idea of sliding mode control (SMC) to ensure system stability and robustness without assuming that the nonlinearity goes to zero faster than the state as in the existing methods. The SMC consists of a data-dependent robust controller ensuring the system state trajectory reach and remain on the sliding surface and a nominal controller solved from a data-dependent semidefinite program (SDP) ensuring robust stability of the state trajectory on the sliding surface. Numerical simulation results demonstrate effectiveness of the proposed data-driven SMC and its superior in terms of robust stability over the existing data-driven control that also uses approximate nonlinearity cancellation.
comment: Submitted to IEEE CDC 2024
☆ Runtime Monitoring and Fault Detection for Neural Network-Controlled Systems
There is an emerging trend in applying deep learning methods to control complex nonlinear systems. This paper considers enhancing the runtime safety of nonlinear systems controlled by neural networks in the presence of disturbance and measurement noise. A robustly stable interval observer is designed to generate sound and precise lower and upper bounds for the neural network, nonlinear function, and system state. The obtained interval is utilised to monitor the real-time system safety and detect faults in the system outputs or actuators. An adaptive cruise control vehicular system is simulated to demonstrate effectiveness of the proposed design.
comment: Accepted to SAFEPROCESS 2024
☆ Towards a MATLAB Toolbox to compute backstepping kernels using the power series method
In this paper, we extend our previous work on the power series method for computing backstepping kernels. Our first contribution is the development of initial steps towards a MATLAB toolbox dedicated to backstepping kernel computation. This toolbox would exploit MATLAB's linear algebra and sparse matrix manipulation features for enhanced efficiency; our initial findings show considerable improvements in computational speed with respect to the use of symbolical software without loss of precision at high orders. Additionally, we tackle limitations observed in our earlier work, such as slow convergence (due to oscillatory behaviors) and non-converging series (due to loss of analiticity at some singular points). To overcome these challenges, we introduce a technique that mitigates this behaviour by computing the expansion at different points, denoted as localized power series. This approach effectively navigates around singularities, and can also accelerates convergence by using more local approximations. Basic examples are provided to demonstrate these enhancements. Although this research is still ongoing, the significant potential and simplicity of the method already establish the power series approach as a viable and versatile solution for solving backstepping kernel equations, benefiting both novel and experienced practitioners in the field. We anticipate that these developments will be particularly beneficial in training the recently introduced neural operators that approximate backstepping kernels and gains.
comment: Preprint submitted to CDC 2024
☆ Digital control of negative imaginary systems: a discrete-time hybrid integrator-gain system approach
A hybrid integrator-gain system (HIGS) is a control element that switches between an integrator and a gain, which overcomes some inherent limitations of linear controllers. In this paper, we consider using discrete-time HIGS controllers for the digital control of negative imaginary (NI) systems. We show that the discrete-time HIGS themselves are step-advanced negative imaginary systems. For a minimal linear NI system, there always exists a HIGS controller that can asymptotically stablize it. An illustrative example is provided, where we use the proposed HIGS control method to stabilize a discrete-time mass-spring system.
comment: To appear in the 2024 European Control Conference. 7 pages, 3 figures
☆ Force Controlled Printing for Material Extrusion Additive Manufacturing
In material extrusion additive manufacturing, the extrusion process is commonly controlled in a feed-forward fashion. The amount of material to be extruded at each printing location is pre-computed by a planning software. This approach is inherently unable to adapt the extrusion to external and unexpected disturbances, and the quality of the results strongly depends on a number of modeling and tuning parameters. To overcome these limitations, we propose the first framework for Force Controlled Printing for material extrusion additive manufacturing. We utilize a custom-built extruder to measure the extrusion force in real time, and use this quantity as feedback to continuously control the material flow in closed-loop. We demonstrate the existence of a strong correlation between extrusion force and line width, which we exploit to deposit lines of desired width in a width range of 33 % up to 233 % of the nozzle diameter. We also show how Force Controlled Printing outperforms conventional feed-forward extrusion in print quality and disturbance rejection, while requiring little tuning and automatically adapting to changes in the hardware settings. With no adaptation, Force Controlled Printing can deposit lines of desired width under severe disturbances in bed leveling, such as at layer heights ranging between 20 % and 200 % of the nominal height.
comment: Preprint to be submitted to Elsevier Additive Manufacturing
☆ Fisher Information Approach for Masking the Sensing Plan: Applications in Multifunction Radars
How to design a Markov Decision Process (MDP) based radar controller that makes small sacrifices in performance to mask its sensing plan from an adversary? The radar controller purposefully minimizes the Fisher information of its emissions so that an adversary cannot identify the controller's model parameters accurately. Unlike classical open loop statistical inference, where the Fisher information serves as a lower bound for the achievable covariance, this paper employs the Fisher information as a design constraint for a closed loop radar controller to mask its sensing plan. We analytically derive a closed-form expression for the determinant of the Fisher Information Matrix (FIM) pertaining to the parameters of the MDP-based controller. Subsequently, we constrain the MDP with respect to the determinant of the FIM. Numerical results show that the introduction of minor perturbations to the MDP's transition kernel and the total operation cost can reduce the Fisher Information of the emissions. Consequently, this reduction amplifies the variability in policy and transition kernel estimation errors, thwarting the adversary's accuracy in estimating the controller's sensing plan.
♻ ☆ Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning
Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.
♻ ☆ Efficient Recursive Data-enabled Predictive Control (Extended Version)
In the field of model predictive control, Data-enabled Predictive Control (DeePC) offers direct predictive control, bypassing traditional modeling. However, challenges emerge with increased computational demand due to recursive data updates. This paper introduces a novel recursive updating algorithm for DeePC. It emphasizes the use of Singular Value Decomposition (SVD) for efficient low-dimensional transformations of DeePC in its general form, as well as a fast SVD update scheme. Importantly, our proposed algorithm is highly flexible due to its reliance on the general form of DeePC, which is demonstrated to encompass various data-driven methods that utilize Pseudoinverse and Hankel matrices. This is exemplified through a comparison to Subspace Predictive Control, where the algorithm achieves asymptotically consistent prediction for stochastic linear time-invariant systems. Our proposed methodologies' efficacy is validated through simulation studies.
♻ ☆ Hierarchical Federated Learning in Wireless Networks: Pruning Tackles Bandwidth Scarcity and System Heterogeneity
While a practical wireless network has many tiers where end users do not directly communicate with the central server, the users' devices have limited computation and battery powers, and the serving base station (BS) has a fixed bandwidth. Owing to these practical constraints and system models, this paper leverages model pruning and proposes a pruning-enabled hierarchical federated learning (PHFL) in heterogeneous networks (HetNets). We first derive an upper bound of the convergence rate that clearly demonstrates the impact of the model pruning and wireless communications between the clients and the associated BS. Then we jointly optimize the model pruning ratio, central processing unit (CPU) frequency and transmission power of the clients in order to minimize the controllable terms of the convergence bound under strict delay and energy constraints. However, since the original problem is not convex, we perform successive convex approximation (SCA) and jointly optimize the parameters for the relaxed convex problem. Through extensive simulation, we validate the effectiveness of our proposed PHFL algorithm in terms of test accuracy, wall clock time, energy consumption and bandwidth requirement.
comment: Accepted for publications in the IEEE Transactions on Wireless Communications (TWC); \copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
♻ ☆ A cGAN Ensemble-based Uncertainty-aware Surrogate Model for Offline Model-based Optimization in Industrial Control Problems IJCNN 2024
This study focuses on two important problems related to applying offline model-based optimization to real-world industrial control problems. The first problem is how to create a reliable probabilistic model that accurately captures the dynamics present in noisy industrial data. The second problem is how to reliably optimize control parameters without actively collecting feedback from industrial systems. Specifically, we introduce a novel cGAN ensemble-based uncertainty-aware surrogate model for reliable offline model-based optimization in industrial control problems. The effectiveness of the proposed method is demonstrated through extensive experiments conducted on two representative cases, namely a discrete control case and a continuous control case. The results of these experiments show that our method outperforms several competitive baselines in the field of offline model-based optimization for industrial control.
comment: Accepted in IJCNN 2024
♻ ☆ 3D Directed Formation Control with Global Shape Convergence using Bispherical Coordinates
In this paper, we present a novel 3D formation control scheme for directed graphs in a leader-follower configuration, achieving (almost) global convergence to the desired shape. Specifically, we introduce three controlled variables representing bispherical coordinates that uniquely describe the formation in 3D. Acyclic triangulated directed graphs (a class of minimally acyclic persistent graphs) are used to model the inter-agent sensing topology, while the agents' dynamics are governed by single-integrator model. Our analysis demonstrates that the proposed decentralized formation controller ensures (almost) global asymptotic stability while avoiding potential shape ambiguities in the final formation. Furthermore, the control laws are implementable in arbitrarily oriented local coordinate frames of follower agents using only low-cost onboard vision sensors, making it suitable for practical applications. Finally, we validate our formation control approach by a simulation study.
comment: Submitted to ECC 2024
Human-Computer Interaction
☆ Perception and Control of Surfing in Virtual Reality using a 6-DoF Motion Platform
The paper presents a system for simulating surfing in Virtual Reality (VR), emphasizing the recreation of aquatic motions and user-initiated propulsive forces using a 6-Degree of Freedom (DoF) motion platform. We present an algorithmic approach to accurately render surfboard kinematics and interactive paddling dynamics, validated through experimental evaluation with \(N=17\) participants. Results indicate that the system effectively reproduces various acceleration levels, the perception of which is independent of users' body posture. We additionally found that the presence of ocean ripples amplifies the perception of acceleration. This system aims to enhance the realism and interactivity of VR surfing, laying a foundation for future advancements in surf therapy and interactive aquatic VR experiences.
☆ Negotiating the Shared Agency between Humans & AI in the Recommender System
Smart recommendation algorithms have revolutionized information dissemination, enhancing efficiency and reshaping content delivery across various domains. However, concerns about user agency have arisen due to the inherent opacity (information asymmetry) and the nature of one-way output (power asymmetry) on algorithms. While both issues have been criticized by scholars via advocating explainable AI (XAI) and human-AI collaborative decision-making (HACD), few research evaluates their integrated effects on users, and few HACD discussions in recommender systems beyond improving and filtering the results. This study proposes an incubating idea as a missing step in HACD that allows users to control the degrees of AI-recommended content. Then, we integrate it with existing XAI to a flow prototype aimed at assessing the enhancement of user agency. We seek to understand how types of agency impact user perception and experience, and bring empirical evidence to refine the guidelines and designs for human-AI interactive systems.
☆ User Experience in Dataset Search Platform Interfaces
This research investigates User Experience (UX) issues in dataset search platform interfaces, targeting Google Dataset Search and data.europa.eu. It focuses on 6 areas within UX: Initial Interaction, Search Process, Dataset Exploration, Filtering and Sorting, Dataset Actions, and Assistance and Feedback. The evaluation method combines 'The Pandemic Puzzle' user task, think-aloud methods, and demographic and post-task questionnaires. 29 strengths and 63 weaknesses were collected from 19 participants involved in roles within technology firm or academia. While certain insights are specific to particular platforms, most are derived from features commonly observed in dataset search platforms across a variety of fields, implying that our findings are broadly applicable. Observations from commonly found features in dataset search platforms across various fields have led to the development of 10 new design prototypes. Unlike literature retrieval, dataset retrieval involves a significant focus on metadata accessibility and quality, each element of which can impact decision-making. To address issues like reading fatigue from metadata presentation, inefficient methods for results searching, filtering, and selection, along with other unresolved user-centric issues on current platforms. These prototypes concentrate on enhancing metadata-related features. They include a redesigned homepage, an improved search bar, better sorting options, an enhanced search result display, a metadata comparison tool, and a navigation guide. Our aim is to improve usability for a wide range of users, including both developers and researchers.
☆ The Impact of Evolutionary Computation on Robotic Design: A Case Study with an Underactuated Hand Exoskeleton ICRA
Robotic exoskeletons can enhance human strength and aid people with physical disabilities. However, designing them to ensure safety and optimal performance presents significant challenges. Developing exoskeletons should incorporate specific optimization algorithms to find the best design. This study investigates the potential of Evolutionary Computation (EC) methods in robotic design optimization, with an underactuated hand exoskeleton (U-HEx) used as a case study. We propose improving the performance and usability of the U-HEx design, which was initially optimized using a naive brute-force approach, by integrating EC techniques such as Genetic Algorithm and Big Bang-Big Crunch Algorithm. Comparative analysis revealed that EC methods consistently yield more precise and optimal solutions than brute force in a significantly shorter time. This allowed us to improve the optimization by increasing the number of variables in the design, which was impossible with naive methods. The results show significant improvements in terms of the torque magnitude the device transfers to the user, enhancing its efficiency. These findings underline the importance of performing proper optimization while designing exoskeletons, as well as providing a significant improvement to this specific robotic design.
comment: 6 pages (+ref), 4 figures, IEEE International Conference on Robotics and Automation (ICRA) 2024
☆ Vid2Real HRI: Align video-based HRI study designs with real-world settings
HRI research using autonomous robots in real-world settings can produce results with the highest ecological validity of any study modality, but many difficulties limit such studies' feasibility and effectiveness. We propose Vid2Real HRI, a research framework to maximize real-world insights offered by video-based studies. The Vid2Real HRI framework was used to design an online study using first-person videos of robots as real-world encounter surrogates. The online study ($n = 385$) distinguished the within-subjects effects of four robot behavioral conditions on perceived social intelligence and human willingness to help the robot enter an exterior door. A real-world, between-subjects replication ($n = 26$) using two conditions confirmed the validity of the online study's findings and the sufficiency of the participant recruitment target ($22$) based on a power analysis of online study results. The Vid2Real HRI framework offers HRI researchers a principled way to take advantage of the efficiency of video-based study modalities while generating directly transferable knowledge of real-world HRI. Code and data from the study are provided at https://vid2real.github.io/vid2realHRI
♻ ☆ Privacy Dashboards for Citizens and corresponding GDPR Services for Small Data Holders: A Literature Review
Citizens have gained many rights with the GDPR, e.g. the right to get a copy of their personal data. In practice, however, this is fraught with problems for citizens and small data holders. We present a literature review on solutions promising relief in the form of privacy dashboards for citizens and GDPR services for small data holders. Covered topics are analyzed, categorized and compared. This is ought to be a step towards both enabling citizens to exercise their GDPR rights and supporting small data holders to comply with their GDPR duties.
comment: 29 pages
♻ ☆ Designing Sousveillance Tools for Gig Workers CHI 2024
As independently-contracted employees, gig workers disproportionately suffer the consequences of workplace surveillance, which include increased pressures to work, breaches of privacy, and decreased digital autonomy. Despite the negative impacts of workplace surveillance, gig workers lack the tools, strategies, and workplace social support to protect themselves against these harms. Meanwhile, some critical theorists have proposed sousveillance as a potential means of countering such abuses of power, whereby those under surveillance monitor those in positions of authority (e.g., gig workers collect data about requesters/platforms). To understand the benefits of sousveillance systems in the gig economy, we conducted semi-structured interviews and led co-design activities with gig workers. We use "care ethics" as a guiding concept to understand our interview and co-design data, while also focusing on empathic sousveillance technology design recommendations. Through our study, we identify gig workers' attitudes towards and past experiences with sousveillance. We also uncover the type of sousveillance technologies imagined by workers, provide design recommendations, and finish by discussing how to create empowering, empathic spaces on gig platforms.
comment: Published as a conference paper at the ACM Conference on Human Factors in Computing Systems, CHI 2024, 3 figures, 30 pages
♻ ☆ Putting Our Minds Together: Iterative Exploration for Collaborative Mind Mapping
We delineate the development of a mind-mapping system designed concurrently for both VR and desktop platforms. Employing an iterative methodology with groups of users, we systematically examined and improved various facets of our system, including interactions, communication mechanisms and gamification elements, to streamline the mind-mapping process while augmenting situational awareness and promoting active engagement among collaborators. We also report our observational findings on these facets from this iterative design process.
comment: Accepted at AHs 2024
Social and Information Networks
☆ Model, Analyze, and Comprehend User Interactions and Various Attributes within a Social Media Platform
How can we effectively model, analyze, and comprehend user interactions and various attributes within a social media platform based on post-comment relationship? In this study, we propose a novel graph-based approach to model and analyze user interactions within a social media platform based on post-comment relationship. We construct a user interaction graph from social media data and analyze it to gain insights into community dynamics, user behavior, and content preferences. Our investigation reveals that while 56.05% of the active users are strongly connected within the community, only 0.8% of them significantly contribute to its dynamics. Moreover, we observe temporal variations in community activity, with certain periods experiencing heightened engagement. Additionally, our findings highlight a correlation between user activity and popularity showing that more active users are generally more popular. Alongside these, a preference for positive and informative content is also observed where 82.41% users preferred positive and informative content. Overall, our study provides a comprehensive framework for understanding and managing online communities, leveraging graph-based techniques to gain valuable insights into user behavior and community dynamics.
comment: 9 Pages, 8 Figures, 3 Tables
☆ #TeamFollowBack: Detection & Analysis of Follow Back Accounts on Social Media
Follow back accounts inflate their follower counts by engaging in reciprocal followings. Such accounts manipulate the public and the algorithms by appearing more popular than they really are. Despite their potential harm, no studies have analyzed such accounts at scale. In this study, we present the first large-scale analysis of follow back accounts. We formally define follow back accounts and employ a honeypot approach to collect a dataset of such accounts on X (formerly Twitter). We discover and describe 12 communities of follow back accounts from 12 different countries, some of which exhibit clear political agenda. We analyze the characteristics of follow back accounts and report that they are newer, more engaging, and have more followings and followers. Finally, we propose a classifier for such accounts and report that models employing profile metadata and the ego network demonstrate promising results, although achieving high recall is challenging. Our study enhances understanding of the follow back accounts and discovering such accounts in the wild.
comment: Accepted to ICWSM24
☆ Optimized Model Selection for Estimating Treatment Effects from Costly Simulations of the US Opioid Epidemic
Agent-based simulation with a synthetic population can help us compare different treatment conditions while keeping everything else constant within the same population (i.e., as digital twins). Such population-scale simulations require large computational power (i.e., CPU resources) to get accurate estimates for treatment effects. We can use meta models of the simulation results to circumvent the need to simulate every treatment condition. Selecting the best estimating model at a given sample size (number of simulation runs) is a crucial problem. Depending on the sample size, the ability of the method to estimate accurately can change significantly. In this paper, we discuss different methods to explore what model works best at a specific sample size. In addition to the empirical results, we provide a mathematical analysis of the MSE equation and how its components decide which model to select and why a specific method behaves that way in a range of sample sizes. The analysis showed why the direction estimation method is better than model-based methods in larger sample sizes and how the between-group variation and the within-group variation affect the MSE equation.
comment: To be presented in 2024 Annual Simulation Conference (ANNSIM'24)
☆ Spatio-Temporal Graph Convolutional Network Combined Large Language Model: A Deep Learning Framework for Bike Demand Forecasting
This study presents a new deep learning framework, combining Spatio-Temporal Graph Convolutional Network (STGCN) with a Large Language Model (LLM), for bike demand forecasting. Addressing challenges in transforming discrete datasets and integrating unstructured language data, the framework leverages LLMs to extract insights from Points of Interest (POI) text data. The proposed STGCN-L model demonstrates competitive performance compared to existing models, showcasing its potential in predicting bike demand. Experiments using Philadelphia datasets highlight the effectiveness of the hybrid model, emphasizing the need for further exploration and enhancements, such as incorporating additional features like weather data for improved accuracy.
comment: ISNN 2024
☆ On the role of network structure in learning to coordinate with bounded rationality
Many socioeconomic phenomena, such as technology adoption, collaborative problem-solving, and content engagement, involve a collection of agents coordinating to take a common action, aligning their decisions to maximize their individual goals. We consider a model for networked interactions where agents learn to coordinate their binary actions under a strict bound on their rationality. We first prove that our model is a potential game and that the optimal action profile is always to achieve perfect alignment at one of the two possible actions, regardless of the network structure. Using a stochastic learning algorithm known as Log Linear Learning, where agents have the same finite rationality parameter, we show that the probability of agents successfully agreeing on the correct decision is monotonically increasing in the number of network links. Therefore, more connectivity improves the accuracy of collective decision-making, as predicted by the phenomenon known as Wisdom of Crowds. Finally, we show that for a fixed number of links, a regular network maximizes the probability of success. We conclude that when using a network of irrational agents, promoting more homogeneous connectivity improves the accuracy of collective decision-making.
comment: Submitted to 2024 IEEE Conference on Decision and Control
♻ ☆ Four universal growth regimes in degree-dependent first passage percolation on spatial random graphs I
One-dependent first passage percolation is a spreading process on a graph where the transmission time through each edge depends on the direct surroundings of the edge. In particular, the classical iid transmission time $L_{xy}$ is multiplied by $(W_xW_y)^\mu$, a polynomial of the expected degrees $W_x, W_y$ of the endpoints of the edge $xy$, which we call the penalty function. Beyond the Markov case, we also allow any distribution for $L_{xy}$ with regularly varying distribution near $0$. We then run this process on three spatial scale-free random graph models: finite and infinite Geometric Inhomogeneous Random Graphs, and Scale-Free Percolation. In these spatial models, the connection probability between two vertices depends on their spatial distance and on their expected degrees. We show that as the penalty-function, i.e., $\mu$ increases, the transmission time between two far away vertices sweeps through four universal phases: explosive (with tight transmission times), polylogarithmic, polynomial but strictly sublinear, and linear in the Euclidean distance. The strictly polynomial growth phase here is a new phenomenon that so far was extremely rare in spatial graph models. The four growth phases are highly robust in the model parameters and are not restricted to phase boundaries. Further, the transition points between the phases depend non-trivially on the main model parameters: the tail of the degree distribution, a long-range parameter governing the presence of long edges, and the behaviour of the distribution $L$ near $0$. In this paper we develop new methods to prove the upper bounds in all sub-explosive phases. Our companion paper complements these results by providing matching lower bounds in the polynomial and linear regimes.
comment: 82 pages. Companion paper: arXiv:2309.11880
♻ ☆ Designing Sousveillance Tools for Gig Workers CHI 2024
As independently-contracted employees, gig workers disproportionately suffer the consequences of workplace surveillance, which include increased pressures to work, breaches of privacy, and decreased digital autonomy. Despite the negative impacts of workplace surveillance, gig workers lack the tools, strategies, and workplace social support to protect themselves against these harms. Meanwhile, some critical theorists have proposed sousveillance as a potential means of countering such abuses of power, whereby those under surveillance monitor those in positions of authority (e.g., gig workers collect data about requesters/platforms). To understand the benefits of sousveillance systems in the gig economy, we conducted semi-structured interviews and led co-design activities with gig workers. We use "care ethics" as a guiding concept to understand our interview and co-design data, while also focusing on empathic sousveillance technology design recommendations. Through our study, we identify gig workers' attitudes towards and past experiences with sousveillance. We also uncover the type of sousveillance technologies imagined by workers, provide design recommendations, and finish by discussing how to create empowering, empathic spaces on gig platforms.
comment: Published as a conference paper at the ACM Conference on Human Factors in Computing Systems, CHI 2024, 3 figures, 30 pages
♻ ☆ Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure ICML 2023
Graph Neural Networks (GNNs) are popular models for graph learning problems. GNNs show strong empirical performance in many practical tasks. However, the theoretical properties have not been completely elucidated. In this paper, we investigate whether GNNs can exploit the graph structure from the perspective of the expressive power of GNNs. In our analysis, we consider graph generation processes that are controlled by hidden (or latent) node features, which contain all information about the graph structure. A typical example of this framework is kNN graphs constructed from the hidden features. In our main results, we show that GNNs can recover the hidden node features from the input graph alone, even when all node features, including the hidden features themselves and any indirect hints, are unavailable. GNNs can further use the recovered node features for downstream tasks. These results show that GNNs can fully exploit the graph structure by themselves, and in effect, GNNs can use both the hidden and explicit node features for downstream tasks. In the experiments, we confirm the validity of our results by showing that GNNs can accurately recover the hidden features using a GNN architecture built based on our theoretical analysis.
comment: ICML 2023
♻ ☆ Use of mobile phone sensing data to estimate residence and mobility times in urban patches during the COVID-19 epidemic: The case of the 2020 outbreak in Hermosillo, Mexico
It is often necessary to introduce the main characteristics of population mobility dynamics to model critical social phenomena such as the economy, violence, transmission of information, or infectious diseases. In this work, we focus on modeling and inferring urban population mobility using the geospatial data of its inhabitants. The objective is to estimate mobility and times inhabitants spend in the areas of interest, such as zip codes and census geographical areas. The proposed method uses the Brownian bridge model for animal movement in ecology. We illustrate its possible applications using mobile phone GPS data in 2020 from the city of Hermosillo, Sonora, in Mexico. We incorporate the estimated residence-mobility matrix into a multi-patch compartmental SEIR model to assess the effect of mobility changes due to governmental interventions
comment: 28 pages, 7 figures
Distributed, Parallel, and Cluster Computing
☆ LOAM: Low-latency Communication, Caching, and Computation Placement in Data-Intensive Computing Networks
Deploying data- and computation-intensive applications such as large-scale AI into heterogeneous dispersed computing networks can significantly enhance application performance by mitigating bottlenecks caused by limited network resources, including bandwidth, storage, and computing power. However, current resource allocation methods in dispersed computing do not provide a comprehensive solution that considers arbitrary topology, elastic resource amount, reuse of computation results, and nonlinear congestion-dependent optimization objectives. In this paper, we propose LOAM, a low-latency joint communication, caching, and computation placement framework with a rigorous analytical foundation that incorporates the above aspects. We tackle the NP-hard aggregated cost minimization problem with two methods: an offline method with a 1/2 approximation and an online adaptive method with a bounded gap from the optimum. Through extensive simulation, the proposed framework outperforms multiple baselines in both synthesis and real-world network scenarios.
☆ Initialisation and Topology Effects in Decentralised Federated Learning
Fully decentralised federated learning enables collaborative training of individual machine learning models on distributed devices on a network while keeping the training data localised. This approach enhances data privacy and eliminates both the single point of failure and the necessity for central coordination. Our research highlights that the effectiveness of decentralised federated learning is significantly influenced by the network topology of connected devices. A simplified numerical model for studying the early behaviour of these systems leads us to an improved artificial neural network initialisation strategy, which leverages the distribution of eigenvector centralities of the nodes of the underlying network, leading to a radically improved training efficiency. Additionally, our study explores the scaling behaviour and choice of environmental parameters under our proposed initialisation strategy. This work paves the way for more efficient and scalable artificial neural network training in a distributed and uncoordinated environment, offering a deeper understanding of the intertwining roles of network structure and learning dynamics.
☆ Resource-efficient Parallel Split Learning in Heterogeneous Edge Computing
Edge AI has been recently proposed to facilitate the training and deployment of Deep Neural Network (DNN) models in proximity to the sources of data. To enable the training of large models on resource-constraint edge devices and protect data privacy, parallel split learning is becoming a practical and popular approach. However, current parallel split learning neglects the resource heterogeneity of edge devices, which may lead to the straggler issue. In this paper, we propose EdgeSplit, a novel parallel split learning framework to better accelerate distributed model training on heterogeneous and resource-constraint edge devices. EdgeSplit enhances the efficiency of model training on less powerful edge devices by adaptively segmenting the model into varying depths. Our approach focuses on reducing total training time by formulating and solving a task scheduling problem, which determines the most efficient model partition points and bandwidth allocation for each device. We employ a straightforward yet effective alternating algorithm for this purpose. Comprehensive tests conducted with a range of DNN models and datasets demonstrate that EdgeSplit not only facilitates the training of large models on resource-restricted edge devices but also surpasses existing baselines in performance.
comment: Accepted by International Conference on Computing, Networking and Communications (ICNC 2024)
☆ An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning CVPR2024
Heterogeneous Federated Learning (HtFL) enables collaborative learning on multiple clients with different model architectures while preserving privacy. Despite recent research progress, knowledge sharing in HtFL is still difficult due to data and model heterogeneity. To tackle this issue, we leverage the knowledge stored in pre-trained generators and propose a new upload-efficient knowledge transfer scheme called Federated Knowledge-Transfer Loop (FedKTL). Our FedKTL can produce client-task-related prototypical image-vector pairs via the generator's inference on the server. With these pairs, each client can transfer pre-existing knowledge from the generator to its local model through an additional supervised local task. We conduct extensive experiments on four datasets under two types of data heterogeneity with 14 kinds of models including CNNs and ViTs. Results show that our upload-efficient FedKTL surpasses seven state-of-the-art methods by up to 7.31% in accuracy. Moreover, our knowledge transfer scheme is applicable in scenarios with only one edge client. Code: https://github.com/TsingZ0/FedKTL
comment: Accepted by CVPR2024
☆ Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing
Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms are crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
comment: 14 pages, 16 figures, 2 tables
☆ Navigating the Landscape of Distributed File Systems: Architectures, Implementations, and Considerations
Distributed File Systems (DFS) have emerged as sophisticated solutions for efficient file storage and management across interconnected computer nodes. The main objective of DFS is to achieve flexible, scalable, and resilient file storage management by dispersing file data across multiple interconnected computer nodes, enabling users to seamlessly access and manipulate files distributed across diverse nodes. This article provides an overview of DFS, its architecture, classification methods, design considerations, challenges, and common implementations. Common DFS implementations discussed include NFS, AFS, GFS, HDFS, and CephFS, each tailored to specific use cases and design goals. Understanding the nuances of DFS architecture, classification, and design considerations is crucial for developing efficient, stable, and secure distributed file systems to meet diverse user and application needs in modern computing environments.
☆ Improved Methods of Task Assignment and Resource Allocation with Preemption in Edge Computing Systems
Edge computing has become a very popular service that enables mobile devices to run complex tasks with the help of network-based computing resources. However, edge clouds are often resource-constrained, which makes resource allocation a challenging issue. In addition, edge cloud servers must make allocation decisions with only limited information available, since the arrival of future client tasks might be impossible to predict, and the states and behavior of neighboring servers might be obscured. We focus on a distributed resource allocation method in which servers operate independently and do not communicate with each other, but interact with clients (tasks) to make allocation decisions. We follow a two-round bidding approach to assign tasks to edge cloud servers, and servers are allowed to preempt previous tasks to allocate more useful ones. We evaluate the performance of our system using realistic simulations and real-world trace data from a high-performance computing cluster. Results show that our heuristic improves system-wide performance by $20-25\%$ over previous work when accounting for the time taken by each approach. In this way, an ideal trade-off between performance and speed is achieved.
comment: 13 pages
☆ TablePuppet: A Generic Framework for Relational Federated Learning
Current federated learning (FL) approaches view decentralized training data as a single table, divided among participants either horizontally (by rows) or vertically (by columns). However, these approaches are inadequate for handling distributed relational tables across databases. This scenario requires intricate SQL operations like joins and unions to obtain the training data, which is either costly or restricted by privacy concerns. This raises the question: can we directly run FL on distributed relational tables? In this paper, we formalize this problem as relational federated learning (RFL). We propose TablePuppet, a generic framework for RFL that decomposes the learning process into two steps: (1) learning over join (LoJ) followed by (2) learning over union (LoU). In a nutshell, LoJ pushes learning down onto the vertical tables being joined, and LoU further pushes learning down onto the horizontal partitions of each vertical table. TablePuppet incorporates computation/communication optimizations to deal with the duplicate tuples introduced by joins, as well as differential privacy (DP) to protect against both feature and label leakages. We demonstrate the efficiency of TablePuppet in combination with two widely-used ML training algorithms, stochastic gradient descent (SGD) and alternating direction method of multipliers (ADMM), and compare their computation/communication complexity. We evaluate the SGD/ADMM algorithms developed atop TablePuppet by training diverse ML models. Our experimental results show that TablePuppet achieves model accuracy comparable to the centralized baselines running directly atop the SQL results. Moreover, ADMM takes less communication time than SGD to converge to similar model accuracy.
comment: 14 pages, 8 figures
☆ Ev-Edge: Efficient Execution of Event-based Vision Algorithms on Commodity Edge Platforms
Event cameras have emerged as a promising sensing modality for autonomous navigation systems, owing to their high temporal resolution, high dynamic range and negligible motion blur. To process the asynchronous temporal event streams from such sensors, recent research has shown that a mix of Artificial Neural Networks (ANNs), Spiking Neural Networks (SNNs) as well as hybrid SNN-ANN algorithms are necessary to achieve high accuracies across a range of perception tasks. However, we observe that executing such workloads on commodity edge platforms which feature heterogeneous processing elements such as CPUs, GPUs and neural accelerators results in inferior performance. This is due to the mismatch between the irregular nature of event streams and diverse characteristics of algorithms on the one hand and the underlying hardware platform on the other. We propose Ev-Edge, a framework that contains three key optimizations to boost the performance of event-based vision systems on edge platforms: (1) An Event2Sparse Frame converter directly transforms raw event streams into sparse frames, enabling the use of sparse libraries with minimal encoding overheads (2) A Dynamic Sparse Frame Aggregator merges sparse frames at runtime by trading off the temporal granularity of events and computational demand thereby improving hardware utilization (3) A Network Mapper maps concurrently executing tasks to different processing elements while also selecting layer precision by considering both compute and communication overheads. On several state-of-art networks for a range of autonomous navigation tasks, Ev-Edge achieves 1.28x-2.05x improvements in latency and 1.23x-2.15x in energy over an all-GPU implementation on the NVIDIA Jetson Xavier AGX platform for single-task execution scenarios. Ev-Edge also achieves 1.43x-1.81x latency improvements over round-robin scheduling methods in multi-task execution scenarios.
♻ ☆ FrankenSplit: Efficient Neural Feature Compression with Shallow Variational Bottleneck Injection for Mobile Edge Computing
The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.
comment: Submission to IEEE Transactions on Mobile Computing
♻ ☆ A Survey for Federated Learning Evaluations: Goals and Measures
Evaluation is a systematic approach to assessing how well a system achieves its intended purpose. Federated learning (FL) is a novel paradigm for privacy-preserving machine learning that allows multiple parties to collaboratively train models without sharing sensitive data. However, evaluating FL is challenging due to its interdisciplinary nature and diverse goals, such as utility, efficiency, and security. In this survey, we first review the major evaluation goals adopted in the existing studies and then explore the evaluation metrics used for each goal. We also introduce FedEval, an open-source platform that provides a standardized and comprehensive evaluation framework for FL algorithms in terms of their utility, efficiency, and security. Finally, we discuss several challenges and future research directions for FL evaluation.
♻ ☆ A learning-based solution approach to the application placement problem in mobile edge computing under uncertainty
Placing applications in mobile edge computing servers presents a complex challenge involving many servers, users, and their requests. Existing algorithms take a long time to solve high-dimensional problems with significant uncertainty scenarios. Therefore, an efficient approach is required to maximize the quality of service while considering all technical constraints. One of these approaches is machine learning, which emulates optimal solutions for application placement in edge servers. Machine learning models are expected to learn how to allocate user requests to servers based on the spatial positions of users and servers. In this study, the problem is formulated as a two-stage stochastic programming. A sufficient amount of training records is generated by varying parameters such as user locations, their request rates, and solving the optimization model. Then, based on the distance features of each user from the available servers and their request rates, machine learning models generate decision variables for the first stage of the stochastic optimization model, which is the user-to-server request allocation, and are employed as independent decision agents that reliably mimic the optimization model. Support Vector Machines (SVM) and Multi-layer Perceptron (MLP) are used in this research to achieve practical decisions from the stochastic optimization models. The performance of each model has shown an execution effectiveness of over 80%. This research aims to provide a more efficient approach for tackling high-dimensional problems and scenarios with uncertainties in mobile edge computing by leveraging machine learning models for optimal decision-making in request allocation to edge servers. These results suggest that machine-learning models can significantly improve solution times compared to conventional approaches.
Cryptography and Security
☆ UPSS: a User-centric Private Storage System with its applications
Strong confidentiality, integrity, user control, reliability and performance are critical requirements in privacy-sensitive applications. Such applications would benefit from a data storage and sharing infrastructure that provides these properties even in decentralized topologies with untrusted storage backends, but users today are forced to choose between systemic security properties and system reliability or performance. As an alternative to this status quo we present UPSS: the user-centric private sharing system, a cryptographic storage system that can be used as a conventional filesystem or as the foundation for security-sensitive applications such as redaction with integrity and private revision control. We demonstrate that both the security and performance properties of UPSS exceed that of existing cryptographic filesystems and that its performance is comparable to mature conventional filesystems - in some cases, even superior. Whether used directly via its Rust API or as a conventional filesystem, UPSS provides strong security and practical performance on untrusted storage.
☆ User-Side Realization
Users are dissatisfied with services. Since the service is not tailor-made for a user, it is natural for dissatisfaction to arise. The problem is, that even if users are dissatisfied, they often do not have the means to resolve their dissatisfaction. The user cannot alter the source code of the service, nor can they force the service provider to change. The user has no choice but to remain dissatisfied or quit the service. User-side realization offers proactive solutions to this problem by providing general algorithms to deal with common problems on the user's side. These algorithms run on the user's side and solve the problems without having the service provider change the service itself.
comment: Doctoral Thesis
☆ Leveraging Large Language Models for Preliminary Security Risk Analysis: A Mission-Critical Case Study
Preliminary security risk analysis (PSRA) provides a quick approach to identify, evaluate and propose remeditation to potential risks in specific scenarios. The extensive expertise required for an effective PSRA and the substantial ammount of textual-related tasks hinder quick assessments in mission-critical contexts, where timely and prompt actions are essential. The speed and accuracy of human experts in PSRA significantly impact response time. A large language model can quickly summarise information in less time than a human. To our knowledge, no prior study has explored the capabilities of fine-tuned models (FTM) in PSRA. Our case study investigates the proficiency of FTM to assist practitioners in PSRA. We manually curated 141 representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years.We compared the proficiency of the FTM versus seven human experts. Within the industrial context, our approach has proven successful in reducing errors in PSRA, hastening security risk detection, and minimizing false positives and negatives. This translates to cost savings for the company by averting unnecessary expenses associated with implementing unwarranted countermeasures. Therefore, experts can focus on more comprehensive risk analysis, leveraging LLMs for an effective preliminary assessment within a condensed timeframe.
☆ Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models
Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to LLMs. How can \textit{\textbf{everyday web users}} confirm if LLMs misuse their data without permission? In this work, we suggest that users repeatedly insert personal passphrases into their documents, enabling LLMs to memorize them. These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training. To explore the effectiveness and usage of this copyrighting tool, we define the \textit{user training data identification} task with ghost sentences. Multiple datasets from various sources at different scales are created and tested with LLMs of different sizes. For evaluation, we introduce a last $k$ words verification manner along with two metrics: document and user identification accuracy. In the specific case of instruction tuning of a 3B LLaMA model, 11 out of 16 users with ghost sentences identify their data within the generation content. These 16 users contribute 383 examples to $\sim$1.8M training documents. For continuing pre-training of a 1.1B TinyLlama model, 61 out of 64 users with ghost sentences identify their data within the LLM output. These 64 users contribute 1156 examples to $\sim$10M training documents.
comment: Preprint, work in progress
☆ A hybrid LLM workflow can help identify user privilege related variables in programs of any size
Many programs involves operations and logic manipulating user privileges, which is essential for the security of an organization. Therefore, one common malicious goal of attackers is to obtain or escalate the privileges, causing privilege leakage. To protect the program and the organization against privilege leakage attacks, it is important to eliminate the vulnerabilities which can be exploited to achieve such attacks. Unfortunately, while memory vulnerabilities are less challenging to find, logic vulnerabilities are much more imminent, harmful and difficult to identify. Accordingly, many analysts choose to find user privilege related (UPR) variables first as start points to investigate the code where the UPR variables may be used to see if there exists any vulnerabilities, especially the logic ones. In this paper, we introduce a large language model (LLM) workflow that can assist analysts in identifying such UPR variables, which is considered to be a very time-consuming task. Specifically, our tool will audit all the variables in a program and output a UPR score, which is the degree of relationship (closeness) between the variable and user privileges, for each variable. The proposed approach avoids the drawbacks introduced by directly prompting a LLM to find UPR variables by focusing on leverage the LLM at statement level instead of supplying LLM with very long code snippets. Those variables with high UPR scores are essentially potential UPR variables, which should be manually investigated. Our experiments show that using a typical UPR score threshold (i.e., UPR score >0.8), the false positive rate (FPR) is only 13.49%, while UPR variable found is significantly more than that of the heuristic based method.
☆ AC4: Algebraic Computation Checker for Circuit Constraints in ZKPs
ZKP systems have surged attention and held a fundamental role in contemporary cryptography. Zk-SNARK protocols dominate the ZKP usage, often implemented through arithmetic circuit programming paradigm. However, underconstrained or overconstrained circuits may lead to bugs. Underconstrained circuits refer to circuits that lack the necessary constraints, resulting in unexpected solutions in the circuit and causing the verifier to accept a bogus witness. Overconstrained circuits refer to circuits that are constrained excessively, resulting in the circuit lacking necessary solutions and causing the verifier to accept no witness, rendering the circuit meaningless. This paper introduces a novel approach for pinpointing two distinct types of bugs in ZKP circuits. The method involves encoding the arithmetic circuit constraints to polynomial equation systems and solving polynomial equation systems over a finite field by algebraic computation. The classification of verification results is refined, greatly enhancing the expressive power of the system. We proposed a tool, AC4, to represent the implementation of this method. Experiments demonstrate that AC4 represents a substantial 29% increase in the checked ratio compared to prior work. Within a solvable range, the checking time of AC4 has also exhibited noticeable improvement, demonstrating a magnitude increase compared to previous efforts.
comment: 20 pages, 4 figures
♻ ☆ Privacy Dashboards for Citizens and corresponding GDPR Services for Small Data Holders: A Literature Review
Citizens have gained many rights with the GDPR, e.g. the right to get a copy of their personal data. In practice, however, this is fraught with problems for citizens and small data holders. We present a literature review on solutions promising relief in the form of privacy dashboards for citizens and GDPR services for small data holders. Covered topics are analyzed, categorized and compared. This is ought to be a step towards both enabling citizens to exercise their GDPR rights and supporting small data holders to comply with their GDPR duties.
comment: 29 pages
♻ ☆ Effect of Ambient-Intrinsic Dimension Gap on Adversarial Vulnerability AISTATS 2024
The existence of adversarial attacks on machine learning models imperceptible to a human is still quite a mystery from a theoretical perspective. In this work, we introduce two notions of adversarial attacks: natural or on-manifold attacks, which are perceptible by a human/oracle, and unnatural or off-manifold attacks, which are not. We argue that the existence of the off-manifold attacks is a natural consequence of the dimension gap between the intrinsic and ambient dimensions of the data. For 2-layer ReLU networks, we prove that even though the dimension gap does not affect generalization performance on samples drawn from the observed data space, it makes the clean-trained model more vulnerable to adversarial perturbations in the off-manifold direction of the data space. Our main results provide an explicit relationship between the $\ell_2,\ell_{\infty}$ attack strength of the on/off-manifold attack and the dimension gap.
comment: AISTATS 2024
♻ ☆ A Survey for Federated Learning Evaluations: Goals and Measures
Evaluation is a systematic approach to assessing how well a system achieves its intended purpose. Federated learning (FL) is a novel paradigm for privacy-preserving machine learning that allows multiple parties to collaboratively train models without sharing sensitive data. However, evaluating FL is challenging due to its interdisciplinary nature and diverse goals, such as utility, efficiency, and security. In this survey, we first review the major evaluation goals adopted in the existing studies and then explore the evaluation metrics used for each goal. We also introduce FedEval, an open-source platform that provides a standardized and comprehensive evaluation framework for FL algorithms in terms of their utility, efficiency, and security. Finally, we discuss several challenges and future research directions for FL evaluation.
♻ ☆ Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
Large Language Models (LLMs) have become a cornerstone in the field of Natural Language Processing (NLP), offering transformative capabilities in understanding and generating human-like text. However, with their rising prominence, the security and vulnerability aspects of these models have garnered significant attention. This paper presents a comprehensive survey of the various forms of attacks targeting LLMs, discussing the nature and mechanisms of these attacks, their potential impacts, and current defense strategies. We delve into topics such as adversarial attacks that aim to manipulate model outputs, data poisoning that affects model training, and privacy concerns related to training data exploitation. The paper also explores the effectiveness of different attack methodologies, the resilience of LLMs against these attacks, and the implications for model integrity and user trust. By examining the latest research, we provide insights into the current landscape of LLM vulnerabilities and defense mechanisms. Our objective is to offer a nuanced understanding of LLM attacks, foster awareness within the AI community, and inspire robust solutions to mitigate these risks in future developments.
♻ ☆ Private measures, random walks, and synthetic data
Differential privacy is a mathematical concept that provides an information-theoretic security guarantee. While differential privacy has emerged as a de facto standard for guaranteeing privacy in data sharing, the known mechanisms to achieve it come with some serious limitations. Utility guarantees are usually provided only for a fixed, a priori specified set of queries. Moreover, there are no utility guarantees for more complex - but very common - machine learning tasks such as clustering or classification. In this paper we overcome some of these limitations. Working with metric privacy, a powerful generalization of differential privacy, we develop a polynomial-time algorithm that creates a private measure from a data set. This private measure allows us to efficiently construct private synthetic data that are accurate for a wide range of statistical analysis tools. Moreover, we prove an asymptotically sharp min-max result for private measures and synthetic data for general compact metric spaces. A key ingredient in our construction is a new superregular random walk, whose joint distribution of steps is as regular as that of independent random variables, yet which deviates from the origin logarithmicaly slowly.
♻ ☆ On the Privacy Effect of Data Enhancement via the Lens of Memorization
Machine learning poses severe privacy concerns as it has been shown that the learned models can reveal sensitive information about their training data. Many works have investigated the effect of widely adopted data augmentation and adversarial training techniques, termed data enhancement in the paper, on the privacy leakage of machine learning models. Such privacy effects are often measured by membership inference attacks (MIAs), which aim to identify whether a particular example belongs to the training set or not. We propose to investigate privacy from a new perspective called memorization. Through the lens of memorization, we find that previously deployed MIAs produce misleading results as they are less likely to identify samples with higher privacy risks as members compared to samples with low privacy risks. To solve this problem, we deploy a recent attack that can capture individual samples' memorization degrees for evaluation. Through extensive experiments, we unveil several findings about the connections between three essential properties of machine learning models, including privacy, generalization gap, and adversarial robustness. We demonstrate that the generalization gap and privacy leakage are less correlated than those of the previous results. Moreover, there is not necessarily a trade-off between adversarial robustness and privacy as stronger adversarial robustness does not make the model more susceptible to privacy attacks.
comment: Accepted by IEEE TIFS, 17 pages
Computation and Language
☆ IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models
The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best-performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of GeminiPro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.
☆ Geotokens and Geotransformers
In transformer architectures, position encoding primarily provides a sense of sequence for input tokens. While the original transformer paper's method has shown satisfactory results in general language processing tasks, there have been new proposals, such as Rotary Position Embedding (RoPE), for further improvement. This paper presents geotokens, input components for transformers, each linked to a specific geological location. Unlike typical language sequences, for these tokens, the order is not as vital as the geographical coordinates themselves. To represent the relative position in this context and to keep a balance between the real world distance and the distance in the embedding space, we design a position encoding approach drawing from the RoPE structure but tailored for spherical coordinates.
☆ LlamBERT: Large-scale low-cost data annotation in NLP
Large Language Models (LLMs), such as GPT-4 and Llama 2, show remarkable proficiency in a wide range of natural language processing (NLP) tasks. Despite their effectiveness, the high costs associated with their use pose a challenge. We present LlamBERT, a hybrid approach that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus. Our results indicate that the LlamBERT approach slightly compromises on accuracy while offering much greater cost-effectiveness.
comment: 11 pages, 1 figure
☆ Leveraging Zero-Shot Prompting for Efficient Language Model Distillation
This paper introduces a novel approach for efficiently distilling LLMs into smaller, application-specific models, significantly reducing operational costs and manual labor. Addressing the challenge of deploying computationally intensive LLMs in specific applications or edge devices, this technique utilizes LLMs' reasoning capabilities to generate labels and natural language rationales for unlabeled data. Our approach enhances both finetuning and distillation by employing a multi-task training framework where student models mimic these rationales alongside teacher predictions. Key contributions include the employment of zero-shot prompting to elicit teacher model rationales, reducing the necessity for handcrafted few-shot examples and lowering the overall token count required, which directly translates to cost savings given the pay-per-token billing model of major tech companies' LLM APIs. Additionally, the paper investigates the impact of explanation properties on distillation efficiency, demonstrating that minimal performance loss occurs even when rationale augmentation is not applied across the entire dataset, facilitating further reductions of tokens. This research marks a step toward the efficient training of task-specific models with minimal human intervention, offering substantial cost-savings while maintaining, or even enhancing, performance.
☆ STEntConv: Predicting Disagreement with Stance Detection and a Signed Graph Convolutional Network LREC
The rise of social media platforms has led to an increase in polarised online discussions, especially on political and socio-cultural topics such as elections and climate change. We propose a simple and novel unsupervised method to predict whether the authors of two posts agree or disagree, leveraging user stances about named entities obtained from their posts. We present STEntConv, a model which builds a graph of users and named entities weighted by stance and trains a Signed Graph Convolutional Network (SGCN) to detect disagreement between comment and reply posts. We run experiments and ablation studies and show that including this information improves disagreement detection performance on a dataset of Reddit posts for a range of controversial subreddit topics, without the need for platform-specific features or user history.
comment: Accepted for the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
☆ VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding NAACL 2024
The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark. The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding. To provide an insightful overview of the current state of Vietnamese NLU, we then evaluate seven state-of-the-art pre-trained models, including both multilingual and Vietnamese monolingual models, on our proposed VLUE benchmark. Furthermore, we present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark. Our model combines the proficiency of a multilingual pre-trained model with Vietnamese linguistic knowledge. CafeBERT is developed based on the XLM-RoBERTa model, with an additional pretraining step utilizing a significant amount of Vietnamese textual data to enhance its adaptation to the Vietnamese language. For the purpose of future research, CafeBERT is made publicly available for research purposes.
comment: Accepted at NAACL 2024 (Findings)
☆ LAMPER: LanguAge Model and Prompt EngineeRing for zero-shot time series classification ICLR 2024
This study constructs the LanguAge Model with Prompt EngineeRing (LAMPER) framework, designed to systematically evaluate the adaptability of pre-trained language models (PLMs) in accommodating diverse prompts and their integration in zero-shot time series (TS) classification. We deploy LAMPER in experimental assessments using 128 univariate TS datasets sourced from the UCR archive. Our findings indicate that the feature representation capacity of LAMPER is influenced by the maximum input token threshold imposed by PLMs.
comment: Accepted as tiny paper in ICLR 2024
☆ RAAMove: A Corpus for Analyzing Moves in Research Article Abstracts LREC
Move structures have been studied in English for Specific Purposes (ESP) and English for Academic Purposes (EAP) for decades. However, there are few move annotation corpora for Research Article (RA) abstracts. In this paper, we introduce RAAMove, a comprehensive multi-domain corpus dedicated to the annotation of move structures in RA abstracts. The primary objective of RAAMove is to facilitate move analysis and automatic move identification. This paper provides a thorough discussion of the corpus construction process, including the scheme, data collection, annotation guidelines, and annotation procedures. The corpus is constructed through two stages: initially, expert annotators manually annotate high-quality data; subsequently, based on the human-annotated data, a BERT-based model is employed for automatic annotation with the help of experts' modification. The result is a large-scale and high-quality corpus comprising 33,988 annotated instances. We also conduct preliminary move identification experiments using the BERT-based model to verify the effectiveness of the proposed corpus and model. The annotated corpus is available for academic research purposes and can serve as essential resources for move analysis, English language teaching and writing, as well as move/discourse-related tasks in Natural Language Processing (NLP).
comment: Accepted by LREC-COLING 2024
☆ Centered Masking for Language-Image Pre-Training
We introduce Gaussian masking for Language-Image Pre-Training (GLIP) a novel, straightforward, and effective technique for masking image patches during pre-training of a vision-language model. GLIP builds on Fast Language-Image Pre-Training (FLIP), which randomly masks image patches while training a CLIP model. GLIP replaces random masking with centered masking, that uses a Gaussian distribution and is inspired by the importance of image patches at the center of the image. GLIP retains the same computational savings as FLIP, while improving performance across a range of downstream datasets and tasks, as demonstrated by our experimental results. We show the benefits of GLIP to be easy to obtain, requiring no delicate tuning of the Gaussian, and also applicable to data sets containing images without an obvious center focus.
☆ Computational Sentence-level Metrics Predicting Human Sentence Comprehension
The majority of research in computational psycholinguistics has concentrated on the processing of words. This study introduces innovative methods for computing sentence-level metrics using multilingual large language models. The metrics developed sentence surprisal and sentence relevance and then are tested and compared to validate whether they can predict how humans comprehend sentences as a whole across languages. These metrics offer significant interpretability and achieve high accuracy in predicting human sentence reading speeds. Our results indicate that these computational sentence-level metrics are exceptionally effective at predicting and elucidating the processing difficulties encountered by readers in comprehending sentences as a whole across a variety of languages. Their impressive performance and generalization capabilities provide a promising avenue for future research in integrating LLMs and cognitive science.
☆ MRC-based Nested Medical NER with Co-prediction and Adaptive Pre-training
In medical information extraction, medical Named Entity Recognition (NER) is indispensable, playing a crucial role in developing medical knowledge graphs, enhancing medical question-answering systems, and analyzing electronic medical records. The challenge in medical NER arises from the complex nested structures and sophisticated medical terminologies, distinguishing it from its counterparts in traditional domains. In response to these complexities, we propose a medical NER model based on Machine Reading Comprehension (MRC), which uses a task-adaptive pre-training strategy to improve the model's capability in the medical field. Meanwhile, our model introduces multiple word-pair embeddings and multi-granularity dilated convolution to enhance the model's representation ability and uses a combined predictor of Biaffine and MLP to improve the model's recognition performance. Experimental evaluations conducted on the CMeEE, a benchmark for Chinese nested medical NER, demonstrate that our proposed model outperforms the compared state-of-the-art (SOTA) models.
☆ Understanding Emergent Abilities of Language Models from the Loss Perspective
Recent studies have put into question the belief that emergent abilities in language models are exclusive to large models. This skepticism arises from two observations: 1) smaller models can also exhibit high performance on emergent abilities and 2) there is doubt on the discontinuous metrics used to measure these abilities. In this paper, we propose to study emergent abilities in the lens of pre-training loss, instead of model size or training compute. We demonstrate that the models with the same pre-training loss, but different model and data sizes, generate the same performance on various downstream tasks. We also discover that a model exhibits emergent abilities on certain tasks -- regardless of the continuity of metrics -- when its pre-training loss falls below a specific threshold. Before reaching this threshold, its performance remains at the level of random guessing. This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by merely extrapolating the performance trends of models with higher pre-training losses.
comment: 18 pages, 6 figures
☆ Modeling Unified Semantic Discourse Structure for High-quality Headline Generation
Headline generation aims to summarize a long document with a short, catchy title that reflects the main idea. This requires accurately capturing the core document semantics, which is challenging due to the lengthy and background information-rich na ture of the texts. In this work, We propose using a unified semantic discourse structure (S3) to represent document semantics, achieved by combining document-level rhetorical structure theory (RST) trees with sentence-level abstract meaning representation (AMR) graphs to construct S3 graphs. The hierarchical composition of sentence, clause, and word intrinsically characterizes the semantic meaning of the overall document. We then develop a headline generation framework, in which the S3 graphs are encoded as contextual features. To consolidate the efficacy of S3 graphs, we further devise a hierarchical structure pruning mechanism to dynamically screen the redundant and nonessential nodes within the graph. Experimental results on two headline generation datasets demonstrate that our method outperforms existing state-of-art methods consistently. Our work can be instructive for a broad range of document modeling tasks, more than headline or summarization generation.
☆ User-Side Realization
Users are dissatisfied with services. Since the service is not tailor-made for a user, it is natural for dissatisfaction to arise. The problem is, that even if users are dissatisfied, they often do not have the means to resolve their dissatisfaction. The user cannot alter the source code of the service, nor can they force the service provider to change. The user has no choice but to remain dissatisfied or quit the service. User-side realization offers proactive solutions to this problem by providing general algorithms to deal with common problems on the user's side. These algorithms run on the user's side and solve the problems without having the service provider change the service itself.
comment: Doctoral Thesis
☆ Leveraging Large Language Models for Preliminary Security Risk Analysis: A Mission-Critical Case Study
Preliminary security risk analysis (PSRA) provides a quick approach to identify, evaluate and propose remeditation to potential risks in specific scenarios. The extensive expertise required for an effective PSRA and the substantial ammount of textual-related tasks hinder quick assessments in mission-critical contexts, where timely and prompt actions are essential. The speed and accuracy of human experts in PSRA significantly impact response time. A large language model can quickly summarise information in less time than a human. To our knowledge, no prior study has explored the capabilities of fine-tuned models (FTM) in PSRA. Our case study investigates the proficiency of FTM to assist practitioners in PSRA. We manually curated 141 representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years.We compared the proficiency of the FTM versus seven human experts. Within the industrial context, our approach has proven successful in reducing errors in PSRA, hastening security risk detection, and minimizing false positives and negatives. This translates to cost savings for the company by averting unnecessary expenses associated with implementing unwarranted countermeasures. Therefore, experts can focus on more comprehensive risk analysis, leveraging LLMs for an effective preliminary assessment within a condensed timeframe.
☆ On the Fragility of Active Learners
Active learning (AL) techniques aim to maximally utilize a labeling budget by iteratively selecting instances that are most likely to improve prediction accuracy. However, their benefit compared to random sampling has not been consistent across various setups, e.g., different datasets, classifiers. In this empirical study, we examine how a combination of different factors might obscure any gains from an AL technique. Focusing on text classification, we rigorously evaluate AL techniques over around 1000 experiments that vary wrt the dataset, batch size, text representation and the classifier. We show that AL is only effective in a narrow set of circumstances. We also address the problem of using metrics that are better aligned with real world expectations. The impact of this study is in its insights for a practitioner: (a) the choice of text representation and classifier is as important as that of an AL technique, (b) choice of the right metric is critical in assessment of the latter, and, finally, (c) reported AL results must be holistically interpreted, accounting for variables other than just the query strategy.
☆ Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models
Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to LLMs. How can \textit{\textbf{everyday web users}} confirm if LLMs misuse their data without permission? In this work, we suggest that users repeatedly insert personal passphrases into their documents, enabling LLMs to memorize them. These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training. To explore the effectiveness and usage of this copyrighting tool, we define the \textit{user training data identification} task with ghost sentences. Multiple datasets from various sources at different scales are created and tested with LLMs of different sizes. For evaluation, we introduce a last $k$ words verification manner along with two metrics: document and user identification accuracy. In the specific case of instruction tuning of a 3B LLaMA model, 11 out of 16 users with ghost sentences identify their data within the generation content. These 16 users contribute 383 examples to $\sim$1.8M training documents. For continuing pre-training of a 1.1B TinyLlama model, 61 out of 64 users with ghost sentences identify their data within the LLM output. These 64 users contribute 1156 examples to $\sim$10M training documents.
comment: Preprint, work in progress
☆ Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning
We consider the task of building a dialogue system that can motivate users to adopt positive lifestyle changes: Motivational Interviewing. Addressing such a task requires a system that can infer \textit{how} to motivate a user effectively. We propose DIIT, a framework that is capable of learning and applying conversation strategies in the form of natural language inductive rules from expert demonstrations. Automatic and human evaluation on instruction-following large language models show natural language strategy descriptions discovered by DIIR can improve active listening skills, reduce unsolicited advice, and promote more collaborative and less authoritative responses, outperforming various demonstration utilization methods.
☆ LLMs Instruct LLMs:An Extraction and Editing Method
The interest in updating Large Language Models (LLMs) without retraining from scratch is substantial, yet it comes with some challenges.This is especially true for situations demanding complex reasoning with limited samples, a scenario we refer to as the Paucity-Constrained Complex Reasoning Adaptation for LLMs (PCRA-LLM).Traditional methods like Low-Rank Adaptation (LoRA) and Retrieval-Augmented Generation (RAG) are inadequate for this critical issue, particularly evident in our exploration of a specific medical context that epitomize the PCRA-LLM's distinct needs.To address the issue, we propose a Sequential Fusion method to incorporate knowledge from complex context into LLMs. This method employs a two-stage framework: initially, it leverages general LLMs to construct knowledge graphs (KGs) for extracting knowledge from complex texts; subsequently, it updates the domain LLMs through knowledge edit. According to our method, the domain LLM achieved a 71.69\% accuracy in question answering tasks. Subsequently, we broadened our assessment to a novel dataset we developed in the economics and management field, where our method realized a 75\% accuracy. These outcomes underline the efficacy and adaptability of our approach for PCRA-LLM across various domains.
comment: Working in progress
☆ Towards a \textbf{RAG}-based Summarization Agent for the Electron-Ion Collider
The complexity and sheer volume of information encompassing documents, papers, data, and other resources from large-scale experiments demand significant time and effort to navigate, making the task of accessing and utilizing these varied forms of information daunting, particularly for new collaborators and early-career scientists. To tackle this issue, a Retrieval Augmented Generation (RAG)--based Summarization AI for EIC (RAGS4EIC) is under development. This AI-Agent not only condenses information but also effectively references relevant responses, offering substantial advantages for collaborators. Our project involves a two-step approach: first, querying a comprehensive vector database containing all pertinent experiment information; second, utilizing a Large Language Model (LLM) to generate concise summaries enriched with citations based on user queries and retrieved data. We describe the evaluation methods that use RAG assessments (RAGAs) scoring mechanisms to assess the effectiveness of responses. Furthermore, we describe the concept of prompt template-based instruction-tuning which provides flexibility and accuracy in summarization. Importantly, the implementation relies on LangChain, which serves as the foundation of our entire workflow. This integration ensures efficiency and scalability, facilitating smooth deployment and accessibility for various user groups within the Electron Ion Collider (EIC) community. This innovative AI-driven framework not only simplifies the understanding of vast datasets but also encourages collaborative participation, thereby empowering researchers. As a demonstration, a web application has been developed to explain each stage of the RAG Agent development in detail.
☆ PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents LREC
Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.
comment: LREC-COLING 2024
☆ EDDA: A Encoder-Decoder Data Augmentation Framework for Zero-Shot Stance Detection
Stance detection aims to determine the attitude expressed in text towards a given target. Zero-shot stance detection (ZSSD) has emerged to classify stances towards unseen targets during inference. Recent data augmentation techniques for ZSSD increase transferable knowledge between targets through text or target augmentation. However, these methods exhibit limitations. Target augmentation lacks logical connections between generated targets and source text, while text augmentation relies solely on training data, resulting in insufficient generalization. To address these issues, we propose an encoder-decoder data augmentation (EDDA) framework. The encoder leverages large language models and chain-of-thought prompting to summarize texts into target-specific if-then rationales, establishing logical relationships. The decoder generates new samples based on these expressions using a semantic correlation word replacement strategy to increase syntactic diversity. We also analyze the generated expressions to develop a rationale-enhanced network that fully utilizes the augmented data. Experiments on benchmark datasets demonstrate our approach substantially improves over state-of-the-art ZSSD techniques. The proposed EDDA framework increases semantic relevance and syntactic variety in augmented texts while enabling interpretable rationale-based learning.
☆ FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models
Emotional Support Conversation (ESC) is a typical dialogue that can effec-tively assist the user in mitigating emotional pressures. However, owing to the inherent subjectivity involved in analyzing emotions, current non-artificial methodologies face challenges in effectively appraising the emo-tional support capability. These metrics exhibit a low correlation with human judgments. Concurrently, manual evaluation methods extremely will cause high costs. To solve these problems, we propose a novel model FEEL (Framework for Evaluating Emotional Support Capability with Large Lan-guage Models), employing Large Language Models (LLMs) as evaluators to assess emotional support capabilities. The model meticulously considers var-ious evaluative aspects of ESC to apply a more comprehensive and accurate evaluation method for ESC. Additionally, it employs a probability distribu-tion approach for a more stable result and integrates an ensemble learning strategy, leveraging multiple LLMs with assigned weights to enhance evalua-tion accuracy. To appraise the performance of FEEL, we conduct extensive experiments on existing ESC model dialogues. Experimental results demon-strate our model exhibits a substantial enhancement in alignment with human evaluations compared to the baselines. Our source code is available at https://github.com/Ansisy/FEEL.
comment: 14 pages,3 figures and 4 tables
☆ MixRED: A Mix-lingual Relation Extraction Dataset
Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix contents from different languages within sentences, generating mix-lingual content. Due to the lack of a dedicated dataset, the effectiveness of existing relation extraction models in such a scenario is largely unexplored. To address this issue, we introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE and constructing the human-annotated dataset MixRED to support this task. In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED, revealing their respective advantages and limitations in the mix-lingual scenario. Furthermore, we delve into factors influencing model performance within the MixRE task and uncover promising directions for enhancing the performance of both supervised models and LLMs in this novel task.
☆ EAGLE: A Domain Generalization Framework for AI-generated Text Detection
With the advancement in capabilities of Large Language Models (LLMs), one major step in the responsible and safe use of such LLMs is to be able to detect text generated by these models. While supervised AI-generated text detectors perform well on text generated by older LLMs, with the frequent release of new LLMs, building supervised detectors for identifying text from such new models would require new labeled training data, which is infeasible in practice. In this work, we tackle this problem and propose a domain generalization framework for the detection of AI-generated text from unseen target generators. Our proposed framework, EAGLE, leverages the labeled data that is available so far from older language models and learns features invariant across these generators, in order to detect text generated by an unknown target generator. EAGLE learns such domain-invariant features by combining the representational power of self-supervised contrastive learning with domain adversarial training. Through our experiments we demonstrate how EAGLE effectively achieves impressive performance in detecting text generated by unseen target generators, including recent state-of-the-art ones such as GPT-4 and Claude, reaching detection scores of within 4.7% of a fully supervised detector.
☆ AC4: Algebraic Computation Checker for Circuit Constraints in ZKPs
ZKP systems have surged attention and held a fundamental role in contemporary cryptography. Zk-SNARK protocols dominate the ZKP usage, often implemented through arithmetic circuit programming paradigm. However, underconstrained or overconstrained circuits may lead to bugs. Underconstrained circuits refer to circuits that lack the necessary constraints, resulting in unexpected solutions in the circuit and causing the verifier to accept a bogus witness. Overconstrained circuits refer to circuits that are constrained excessively, resulting in the circuit lacking necessary solutions and causing the verifier to accept no witness, rendering the circuit meaningless. This paper introduces a novel approach for pinpointing two distinct types of bugs in ZKP circuits. The method involves encoding the arithmetic circuit constraints to polynomial equation systems and solving polynomial equation systems over a finite field by algebraic computation. The classification of verification results is refined, greatly enhancing the expressive power of the system. We proposed a tool, AC4, to represent the implementation of this method. Experiments demonstrate that AC4 represents a substantial 29% increase in the checked ratio compared to prior work. Within a solvable range, the checking time of AC4 has also exhibited noticeable improvement, demonstrating a magnitude increase compared to previous efforts.
comment: 20 pages, 4 figures
☆ AI for Biomedicine in the Era of Large Language Models
The capabilities of AI for biomedicine span a wide spectrum, from the atomic level, where it solves partial differential equations for quantum systems, to the molecular level, predicting chemical or protein structures, and further extending to societal predictions like infectious disease outbreaks. Recent advancements in large language models, exemplified by models like ChatGPT, have showcased significant prowess in natural language tasks, such as translating languages, constructing chatbots, and answering questions. When we consider biomedical data, we observe a resemblance to natural language in terms of sequences: biomedical literature and health records presented as text, biological sequences or sequencing data arranged in sequences, or sensor data like brain signals as time series. The question arises: Can we harness the potential of recent large language models to drive biomedical knowledge discoveries? In this survey, we will explore the application of large language models to three crucial categories of biomedical data: 1) textual data, 2) biological sequences, and 3) brain signals. Furthermore, we will delve into large language model challenges in biomedical research, including ensuring trustworthiness, achieving personalization, and adapting to multi-modal data representation
comment: 8 pages, 3 figures
♻ ☆ BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability
While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the "content effect"). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions.
comment: Added note about how dataset sampling was performed
♻ ☆ MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.
♻ ☆ PILOT: Legal Case Outcome Prediction with Case Law
Machine learning shows promise in predicting the outcome of legal cases, but most research has concentrated on civil law cases rather than case law systems. We identified two unique challenges in making legal case outcome predictions with case law. First, it is crucial to identify relevant precedent cases that serve as fundamental evidence for judges during decision-making. Second, it is necessary to consider the evolution of legal principles over time, as early cases may adhere to different legal contexts. In this paper, we proposed a new framework named PILOT (PredictIng Legal case OuTcome) for case outcome prediction. It comprises two modules for relevant case retrieval and temporal pattern handling, respectively. To benchmark the performance of existing legal case outcome prediction models, we curated a dataset from a large-scale case law database. We demonstrate the importance of accurately identifying precedent cases and mitigating the temporal shift when making predictions for case law, as our method shows a significant improvement over the prior methods that focus on civil law case outcome predictions.
♻ ☆ Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers
Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.
comment: Code is available at https://github.com/ruoyuxie/interpretable_dialect_classifier
♻ ☆ Dissociating language and thought in large language models
Large Language Models (LLMs) have come closest among all models to date to mastering human language, yet opinions about their linguistic and cognitive capabilities remain split. Here, we evaluate LLMs using a distinction between formal linguistic competence - knowledge of linguistic rules and patterns - and functional linguistic competence - understanding and using language in the world. We ground this distinction in human neuroscience, which has shown that formal and functional competence rely on different neural mechanisms. Although LLMs are surprisingly good at formal competence, their performance on functional competence tasks remains spotty and often requires specialized fine-tuning and/or coupling with external modules. We posit that models that use language in human-like ways would need to master both of these competence types, which, in turn, could require the emergence of mechanisms specialized for formal linguistic competence, distinct from functional competence.
comment: The two lead authors contributed equally to this work; published in "Trends in Cognnitive Sciences", March 2024
♻ ☆ Knowledge Graph Large Language Model (KG-LLM) for Link Prediction
The task of predicting multiple links within knowledge graphs (KGs) stands as a challenge in the field of knowledge graph analysis, a challenge increasingly resolvable due to advancements in natural language processing (NLP) and KG embedding techniques. This paper introduces a novel methodology, the Knowledge Graph Large Language Model Framework (KG-LLM), which leverages pivotal NLP paradigms, including chain-of-thought (CoT) prompting and in-context learning (ICL), to enhance multi-hop link prediction in KGs. By converting the KG to a CoT prompt, our framework is designed to discern and learn the latent representations of entities and their interrelations. To show the efficacy of the KG-LLM Framework, we fine-tune three leading Large Language Models (LLMs) within this framework, employing both non-ICL and ICL tasks for a comprehensive evaluation. Further, we explore the framework's potential to provide LLMs with zero-shot capabilities for handling previously unseen prompts. Our experimental findings discover that integrating ICL and CoT not only augments the performance of our approach but also significantly boosts the models' generalization capacity, thereby ensuring more precise predictions in unfamiliar scenarios.
comment: 23 pages, 2 figures
♻ ☆ Large Language Models for Generative Recommendation: A Survey and Visionary Discussions LREC
Large language models (LLM) not only have revolutionized the field of natural language processing (NLP) but also have the potential to reshape many other fields, e.g., recommender systems (RS). However, most of the related work treats an LLM as a component of the conventional recommendation pipeline (e.g., as a feature extractor), which may not be able to fully leverage the generative power of LLM. Instead of separating the recommendation process into multiple stages, such as score computation and re-ranking, this process can be simplified to one stage with LLM: directly generating recommendations from the complete pool of items. This survey reviews the progress, methods, and future directions of LLM-based generative recommendation by examining three questions: 1) What generative recommendation is, 2) Why RS should advance to generative recommendation, and 3) How to implement LLM-based generative recommendation for various RS tasks. We hope that this survey can provide the context and guidance needed to explore this interesting and emerging topic.
comment: Published as a conference paper at LREC-COLING 2024
♻ ☆ SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code Generation
Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training samples available in practice lead to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training samples is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named SEED, which stands for Sample-Efficient adaptation with Error-Driven learning for code generation. SEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome its own shortcomings, thus achieving efficient learning. Specifically, SEED involves identifying error code generated by LLMs, employing Self-revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, SEED achieves superior performance with few training samples, showing an average relative improvement of 54.7% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, SEED consistently demonstrates strong performance across various LLMs, underscoring its generalizability.
♻ ☆ In-context Learning with Retrieved Demonstrations for Language Models: A Survey
Language models, especially pre-trained large language models, have showcased remarkable abilities as few-shot in-context learners (ICL), adept at adapting to new tasks with just a few demonstrations in the input context. However, the model's ability to perform ICL is sensitive to the choice of the few-shot demonstrations. Instead of using a fixed set of demonstrations, one recent development is to retrieve demonstrations tailored to each input query. The implementation of demonstration retrieval is relatively straightforward, leveraging existing databases and retrieval systems. This not only improves the efficiency and scalability of the learning process but also has been shown to reduce biases inherent in manual example selection. In light of the encouraging results and growing research in ICL with retrieved demonstrations, we conduct an extensive review of studies in this area. In this survey, we discuss and compare different design choices for retrieval models, retrieval training procedures, and inference algorithms.
♻ ☆ K-Act2Emo: Korean Commonsense Knowledge Graph for Indirect Emotional Expression
In many literary texts, emotions are indirectly conveyed through descriptions of actions, facial expressions, and appearances, necessitating emotion inference for narrative understanding. In this paper, we introduce K-Act2Emo, a Korean commonsense knowledge graph (CSKG) comprising 1,900 indirect emotional expressions and the emotions inferable from them. We categorize reasoning types into inferences in positive situations, inferences in negative situations, and inferences when expressions do not serve as emotional cues. Unlike existing CSKGs, K-Act2Emo specializes in emotional contexts, and experimental results validate its effectiveness for training emotion inference models. Significantly, the BART-based knowledge model fine-tuned with K-Act2Emo outperforms various existing Korean large language models, achieving performance levels comparable to GPT-4 Turbo.
comment: 10 pages
♻ ☆ Large Language Models for Mathematical Reasoning: Progresses and Challenges EACL 2024
Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.
comment: EACL 2024 Student Research Workshop, 8 pages
♻ ☆ Extracting Emotion Phrases from Tweets using BART
Sentiment analysis is a natural language processing task that aims to identify and extract the emotional aspects of a text. However, many existing sentiment analysis methods primarily classify the overall polarity of a text, overlooking the specific phrases that convey sentiment. In this paper, we applied an approach to sentiment analysis based on a question-answering framework. Our approach leverages the power of Bidirectional Autoregressive Transformer (BART), a pre-trained sequence-to-sequence model, to extract a phrase from a given text that amplifies a given sentiment polarity. We create a natural language question that identifies the specific emotion to extract and then guide BART to pay attention to the relevant emotional cues in the text. We use a classifier within BART to predict the start and end positions of the answer span within the text, which helps to identify the precise boundaries of the extracted emotion phrase. Our approach offers several advantages over most sentiment analysis studies, including capturing the complete context and meaning of the text and extracting precise token spans that highlight the intended sentiment. We achieved an end loss of 87% and Jaccard score of 0.61.
♻ ☆ Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: a Case Study on Hateful Memes LREC
Amidst the rapid expansion of Machine Learning (ML) and Large Language Models (LLMs), understanding the semantics within their mechanisms is vital. Causal analyses define semantics, while gradient-based methods are essential to eXplainable AI (XAI), interpreting the model's 'black box'. Integrating these, we investigate how a model's mechanisms reveal its causal effect on evidence-based decision-making. Research indicates intersectionality - the combined impact of an individual's demographics - can be framed as an Average Treatment Effect (ATE). This paper demonstrates that hateful meme detection can be viewed as an ATE estimation using intersectionality principles, and summarized gradient-based attention scores highlight distinct behaviors of three Transformer models. We further reveal that LLM Llama-2 can discern the intersectional aspects of the detection through in-context learning and that the learning process could be explained via meta-gradient, a secondary form of gradient. In conclusion, this work furthers the dialogue on Causality and XAI. Our code is available online (see External Resources section).
comment: Accepted to LREC-COLING 2024
♻ ☆ A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning NAACL 2024
Logical reasoning has been an ongoing pursuit in the field of AI. Despite significant advancements made by large language models (LLMs), they still struggle with complex logical reasoning problems. To enhance reasoning performance, one promising direction is scalable oversight, which requires LLMs to identify their own errors and then improve by themselves. Various self-verification methods have been proposed in pursuit of this goal. Nevertheless, whether existing models understand their own errors well is still under investigation. In this paper, we take a closer look at the self-verification abilities of LLMs in the context of logical reasoning, focusing on their ability to identify logical fallacies accurately. We introduce a dataset, FALLACIES, containing 232 types of reasoning fallacies categorized in a hierarchical taxonomy. By conducting exhaustive experiments on FALLACIES, we obtain comprehensive and detailed analyses of a series of models on their verification abilities. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods. Drawing from these observations, we offer suggestions for future research and practical applications of self-verification methods.
comment: NAACL 2024 Main Conference
♻ ☆ K-pop Lyric Translation: Dataset, Analysis, and Neural-Modelling LREC
Lyric translation, a field studied for over a century, is now attracting computational linguistics researchers. We identified two limitations in previous studies. Firstly, lyric translation studies have predominantly focused on Western genres and languages, with no previous study centering on K-pop despite its popularity. Second, the field of lyric translation suffers from a lack of publicly available datasets; to the best of our knowledge, no such dataset exists. To broaden the scope of genres and languages in lyric translation studies, we introduce a novel singable lyric translation dataset, approximately 89\% of which consists of K-pop song lyrics. This dataset aligns Korean and English lyrics line-by-line and section-by-section. We leveraged this dataset to unveil unique characteristics of K-pop lyric translation, distinguishing it from other extensively studied genres, and to construct a neural lyric translation model, thereby underscoring the importance of a dedicated dataset for singable lyric translations.
comment: LREC-COLING 2024
♻ ☆ KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models
Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.
comment: TheWebConf 2024
♻ ☆ LLM Paternity Test: Generated Text Detection with LLM Genetic Inheritance
Large language models (LLMs) can generate texts that carry the risk of various misuses, including plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets. Detecting whether a text is machine-generated has thus become increasingly important. While existing detection methods exhibit superior performance, they often lack generalizability due to their heavy dependence on training data. To alleviate this problem, we propose a model-related generated text detection method, the LLM Paternity Test (LLM-Pat). Specifically, given any candidate text (\textit{child}), LLM-Pat employs an intermediary LLM (\textit{parent}) to reconstruct a \textit{sibling} text corresponding to the given text and then measures the similarity between candidate texts and their sibling texts. High similarity indicates that the candidate text is machine-generated, akin to genetic traits. We have constructed datasets encompassing four scenarios: student responses in educational settings, news creation, academic paper writing, and social media bots to assess the performance of LLM-Pat. The experiments show that LLM-Pat outperforms the existing detection methods and is more robust against paraphrasing attacks and re-translating attacks. Besides, LLM-Pat can also be used to trace which large language model the text was generated by. The constructed dataset and code will be released to benefit the community.
♻ ☆ Tandem Transformers for Inference Efficient LLMs
The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations. We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.
♻ ☆ Multilingual Coreference Resolution in Low-resource South Asian Languages LREC
Coreference resolution involves the task of identifying text spans within a discourse that pertain to the same real-world entity. While this task has been extensively explored in the English language, there has been a notable scarcity of publicly accessible resources and models for coreference resolution in South Asian languages. We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages using off-the-shelf tools for translation and word-alignment. Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations. Using multilingual encoders, two off-the-shelf coreference resolution models were trained on a concatenation of TransMuCoRes and a Hindi coreference resolution dataset with manual annotations. The best performing model achieved a score of 64 and 68 for LEA F1 and CoNLL F1, respectively, on our test-split of Hindi golden set. This study is the first to evaluate an end-to-end coreference resolution model on a Hindi golden set. Furthermore, this work underscores the limitations of current coreference evaluation metrics when applied to datasets with split antecedents, advocating for the development of more suitable evaluation metrics.
comment: Accepted at LREC-COLING 2024
♻ ☆ Improving Low-Resource Knowledge Tracing Tasks by Supervised Pre-training and Importance Mechanism Fine-tuning
Knowledge tracing (KT) aims to estimate student's knowledge mastery based on their historical interactions. Recently, the deep learning based KT (DLKT) approaches have achieved impressive performance in the KT task. These DLKT models heavily rely on the large number of available student interactions. However, due to various reasons such as budget constraints and privacy concerns, observed interactions are very limited in many real-world scenarios, a.k.a, low-resource KT datasets. Directly training a DLKT model on a low-resource KT dataset may lead to overfitting and it is difficult to choose the appropriate deep neural architecture. Therefore, in this paper, we propose a low-resource KT framework called LoReKT to address above challenges. Inspired by the prevalent "pre-training and fine-tuning" paradigm, we aim to learn transferable parameters and representations from rich-resource KT datasets during the pre-training stage and subsequently facilitate effective adaptation to low-resource KT datasets. Specifically, we simplify existing sophisticated DLKT model architectures with purely a stack of transformer decoders. We design an encoding mechanism to incorporate student interactions from multiple KT data sources and develop an importance mechanism to prioritize updating parameters with high importance while constraining less important ones during the fine-tuning stage. We evaluate LoReKT on six public KT datasets and experimental results demonstrate the superiority of our approach in terms of AUC and Accuracy. To encourage reproducible research, we make our data and code publicly available at https://anonymous.4open.science/r/LoReKT-C619.
comment: 29 pages, 4 figures
♻ ☆ Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models NAACL 2024
Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.
comment: Accepted to NAACL 2024 main conference
♻ ☆ Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
Large Language Models (LLMs) have become a cornerstone in the field of Natural Language Processing (NLP), offering transformative capabilities in understanding and generating human-like text. However, with their rising prominence, the security and vulnerability aspects of these models have garnered significant attention. This paper presents a comprehensive survey of the various forms of attacks targeting LLMs, discussing the nature and mechanisms of these attacks, their potential impacts, and current defense strategies. We delve into topics such as adversarial attacks that aim to manipulate model outputs, data poisoning that affects model training, and privacy concerns related to training data exploitation. The paper also explores the effectiveness of different attack methodologies, the resilience of LLMs against these attacks, and the implications for model integrity and user trust. By examining the latest research, we provide insights into the current landscape of LLM vulnerabilities and defense mechanisms. Our objective is to offer a nuanced understanding of LLM attacks, foster awareness within the AI community, and inspire robust solutions to mitigate these risks in future developments.
♻ ☆ Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts CHI 2024
Automatic coding patient behaviors is essential to support decision making for psychotherapists during the motivational interviewing (MI), a collaborative communication intervention approach to address psychiatric issues, such as alcohol and drug addiction. While the behavior coding task has rapidly adapted machine learning to predict patient states during the MI sessions, lacking of domain-specific knowledge and overlooking patient-therapist interactions are major challenges in developing and deploying those models in real practice. To encounter those challenges, we introduce the Chain-of-Interaction (CoI) prompting method aiming to contextualize large language models (LLMs) for psychiatric decision support by the dyadic interactions. The CoI prompting approach systematically breaks down the coding task into three key reasoning steps, extract patient engagement, learn therapist question strategies, and integrates dyadic interactions between patients and therapists. This approach enables large language models to leverage the coding scheme, patient state, and domain knowledge for patient behavioral coding. Experiments on real-world datasets can prove the effectiveness and flexibility of our prompting method with multiple state-of-the-art LLMs over existing prompting baselines. We have conducted extensive ablation analysis and demonstrate the critical role of dyadic interactions in applying LLMs for psychotherapy behavior understanding.
comment: Accepted to IEEE ICHI 2024
♻ ☆ ContrastWSD: Enhancing Metaphor Detection with Word Sense Disambiguation Following the Metaphor Identification Procedure LREC
This paper presents ContrastWSD, a RoBERTa-based metaphor detection model that integrates the Metaphor Identification Procedure (MIP) and Word Sense Disambiguation (WSD) to extract and contrast the contextual meaning with the basic meaning of a word to determine whether it is used metaphorically in a sentence. By utilizing the word senses derived from a WSD model, our model enhances the metaphor detection process and outperforms other methods that rely solely on contextual embeddings or integrate only the basic definitions and other external knowledge. We evaluate our approach on various benchmark datasets and compare it with strong baselines, indicating the effectiveness in advancing metaphor detection.
comment: 9 pages, 2 figures, accepted for the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
♻ ☆ Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models
With the widespread use of language models (LMs) in NLP tasks, researchers have discovered the potential of Chain-of-thought (CoT) to assist LMs in accomplishing complex reasoning tasks by generating intermediate steps. However, human thought processes are often non-linear, rather than simply sequential chains of thoughts. Therefore, we propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph. By representing thought units as nodes and connections between them as edges, our approach captures the non-sequential nature of human thinking and allows for a more realistic modeling of thought processes. GoT adopts a two-stage framework with an additional GoT encoder for thought graph representation and fuses the graph representation with the original input representation through a gated fusion mechanism. We evaluate GoT's performance on a text-only reasoning task (AQUA-RAT) and a multimodal reasoning task (ScienceQA). Our model achieves significant improvement over the strong CoT baseline on the AQUA-RAT test set and boosts accuracy from 85.19% to 87.59% using the T5-base model over the state-of-the-art Multimodal-CoT on the ScienceQA test set.
♻ ☆ Benchmarking LLM-based Machine Translation on Cultural Awareness
Translating cultural-specific content is crucial for effective cross-cultural communication. However, many MT systems still struggle to translate sentences containing cultural-specific entities accurately and understandably. Recent advancements in in-context learning utilize lightweight prompts to guide large language models (LLMs) in machine translation tasks. Nevertheless, the effectiveness of this approach in enhancing machine translation with cultural awareness remains uncertain. To address this gap, we introduce a new data curation pipeline to construct a culturally relevant parallel corpus, enriched with annotations of cultural-specific items. Furthermore, we devise a novel evaluation metric to assess the understandability of translations in a reference-free manner by GPT-4. We evaluate a variety of neural machine translation (NMT) and LLM-based MT systems using our dataset. Additionally, we propose several prompting strategies for LLMs to incorporate external and internal cultural knowledge into the translation process. Our results demonstrate that eliciting explanations can significantly enhance the understandability of cultural-specific entities, especially those without well-known translations.
♻ ☆ Multi-party Response Generation with Relation Disentanglement
Existing neural response generation models have achieved impressive improvements for two-party conversations, which assume that utterances are sequentially organized. However, many real-world dialogues involve multiple interlocutors and the structure of conversational context is much more complex, e.g. utterances from different interlocutors can occur "in parallel". Facing this challenge, there are works trying to model the relations among utterances or interlocutors to facilitate response generation with clearer context. Nonetheless, these methods rely heavily on such relations and all assume that these are given beforehand, which is impractical and hinders the generality of such methods. In this work, we propose to automatically infer the relations via relational thinking on subtle clues inside the conversation context without any human label, and leverage these relations to guide the neural response generation. Specifically, we first apply a deep graph random process to fully consider all possible relations among utterances in the conversational context. Then the inferred relation graphs are integrated with a variational auto-encoder framework to train a GAN for structure-aware response generation. Experimental results on the Ubuntu Internet Relay Chat (IRC) channel benchmark and the most recent Movie Dialogues show that our method outperforms various baseline models for multi-party response generation.
comment: The paper needs systematic polishment to consider recent development in dialogue
Software Engineering
☆ Who Uses Personas in Requirements Engineering: The Practitioners' Perspective
Personas are commonly used in software projects to gain a better understanding of end-users' needs. However, there is a limited understanding of their usage and effectiveness in practice. This paper presents the results of a two-step investigation, comprising interviews with 26 software developers, UI/UX designers, business analysts and product managers and a survey of 203 practitioners, aimed at shedding light on the current practices, methods and challenges of using personas in software development. Our findings reveal variations in the frequency and effectiveness of personas across different software projects and IT companies, the challenges practitioners face when using personas and the reasons for not using them at all. Furthermore, we investigate the coverage of human aspects in personas, often assumed to be a key feature of persona descriptions. Contrary to the general perception, our study shows that human aspects are often ignored for various reasons in personas or requirements engineering in general. Our study provides actionable insights for practitioners to overcome challenges in using personas during requirements engineering stages, and we identify areas for future research.
☆ Automated System-level Testing of Unmanned Aerial Systems
Unmanned aerial systems (UAS) rely on various avionics systems that are safety-critical and mission-critical. A major requirement of international safety standards is to perform rigorous system-level testing of avionics software systems. The current industrial practice is to manually create test scenarios, manually/automatically execute these scenarios using simulators, and manually evaluate outcomes. The test scenarios typically consist of setting certain flight or environment conditions and testing the system under test in these settings. The state-of-the-art approaches for this purpose also require manual test scenario development and evaluation. In this paper, we propose a novel approach to automate the system-level testing of the UAS. The proposed approach (AITester) utilizes model-based testing and artificial intelligence (AI) techniques to automatically generate, execute, and evaluate various test scenarios. The test scenarios are generated on the fly, i.e., during test execution based on the environmental context at runtime. The approach is supported by a toolset. We empirically evaluate the proposed approach on two core components of UAS, an autopilot system of an unmanned aerial vehicle (UAV) and cockpit display systems (CDS) of the ground control station (GCS). The results show that the AITester effectively generates test scenarios causing deviations from the expected behavior of the UAV autopilot and reveals potential flaws in the GCS-CDS.
☆ When LLM-based Code Generation Meets the Software Development Process
Software process models play a pivotal role in fostering collaboration and communication within software teams, enabling them to tackle intricate development tasks effectively. This paper introduces LCG, a code generation framework inspired by established software engineering practices. LCG leverages multiple Large Language Model (LLM) agents to emulate various software process models, namely LCGWaterfall, LCGTDD, and LCGScrum. Each model assigns LLM agents specific roles such as requirement engineer, architect, developer, tester, and scrum master, mirroring typical development activities and communication patterns. Through collaborative efforts utilizing chain-of-thought and prompt composition techniques, the agents continuously refine themselves to enhance code quality. Utilizing GPT3.5 as the underlying LLM and baseline (GPT), we evaluate LCG across four code generation benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Results indicate LCGScrum outperforms other models, achieving Pass@1 scores of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively - an average 15% improvement over GPT. Analysis reveals distinct impacts of development activities on generated code, with design and code reviews contributing to enhanced exception handling, while design, testing, and code reviews mitigate code smells. Furthermore, temperature values exhibit negligible influence on Pass@1 across all models. However, variations in Pass@1 are notable for different GPT3.5 model versions, ranging from 5 to over 60 in HumanEval, highlighting the stability of LCG across model versions. This stability underscores the importance of adopting software process models to bolster the quality and consistency of LLM-generated code.
☆ Local Features: Enhancing Variability Modeling in Software Product Lines
Context and motivation: Software Product Lines (SPL) enable the creation of software product families with shared core components using feature models to model variability. Choosing features from a feature model to generate a product may not be sufficient in certain situations because the application engineer may need to be able to decide on configuration time the system's elements to which a certain feature will be applied. Therefore, there is a need to select which features have to be included in the product but also to which of its elements they have to be applied. Objective: We introduce local features that are selectively applied to specific parts of the system during product configuration. Results: We formalize local features using multimodels to establish relationships between local features and other elements of the system models. The paper includes examples illustrating the motivation for local features, a formal definition, and a domain-specific language for specification and implementation. Finally, we present a case study in a real scenario that shows how the concept of local features allowed us to define the variability of a complex system. The examples and the application case show that the proposal achieves higher customization levels at the application engineering phase.
☆ Leveraging Large Language Models for Preliminary Security Risk Analysis: A Mission-Critical Case Study
Preliminary security risk analysis (PSRA) provides a quick approach to identify, evaluate and propose remeditation to potential risks in specific scenarios. The extensive expertise required for an effective PSRA and the substantial ammount of textual-related tasks hinder quick assessments in mission-critical contexts, where timely and prompt actions are essential. The speed and accuracy of human experts in PSRA significantly impact response time. A large language model can quickly summarise information in less time than a human. To our knowledge, no prior study has explored the capabilities of fine-tuned models (FTM) in PSRA. Our case study investigates the proficiency of FTM to assist practitioners in PSRA. We manually curated 141 representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years.We compared the proficiency of the FTM versus seven human experts. Within the industrial context, our approach has proven successful in reducing errors in PSRA, hastening security risk detection, and minimizing false positives and negatives. This translates to cost savings for the company by averting unnecessary expenses associated with implementing unwarranted countermeasures. Therefore, experts can focus on more comprehensive risk analysis, leveraging LLMs for an effective preliminary assessment within a condensed timeframe.
☆ CodeShell Technical Report
Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In this technical report, we present CodeShell-Base, a seven billion-parameter foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. By incorporating Grouped-Query Attention and Rotary Positional Embedding into GPT-2, CodeShell-Base integrates the structural merits of StarCoder and CodeLlama and forms its unique architectural design. We then carefully built a comprehensive data pre-processing process, including similar data deduplication, perplexity-based data filtering, and model-based data filtering. Through this process, We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs). We have conducted extensive experiments across multiple language datasets, including Python, Java, and C++, and the results indicate that our model possesses robust foundational capabilities in code comprehension and generation.
☆ A hybrid LLM workflow can help identify user privilege related variables in programs of any size
Many programs involves operations and logic manipulating user privileges, which is essential for the security of an organization. Therefore, one common malicious goal of attackers is to obtain or escalate the privileges, causing privilege leakage. To protect the program and the organization against privilege leakage attacks, it is important to eliminate the vulnerabilities which can be exploited to achieve such attacks. Unfortunately, while memory vulnerabilities are less challenging to find, logic vulnerabilities are much more imminent, harmful and difficult to identify. Accordingly, many analysts choose to find user privilege related (UPR) variables first as start points to investigate the code where the UPR variables may be used to see if there exists any vulnerabilities, especially the logic ones. In this paper, we introduce a large language model (LLM) workflow that can assist analysts in identifying such UPR variables, which is considered to be a very time-consuming task. Specifically, our tool will audit all the variables in a program and output a UPR score, which is the degree of relationship (closeness) between the variable and user privileges, for each variable. The proposed approach avoids the drawbacks introduced by directly prompting a LLM to find UPR variables by focusing on leverage the LLM at statement level instead of supplying LLM with very long code snippets. Those variables with high UPR scores are essentially potential UPR variables, which should be manually investigated. Our experiments show that using a typical UPR score threshold (i.e., UPR score >0.8), the false positive rate (FPR) is only 13.49%, while UPR variable found is significantly more than that of the heuristic based method.
☆ AC4: Algebraic Computation Checker for Circuit Constraints in ZKPs
ZKP systems have surged attention and held a fundamental role in contemporary cryptography. Zk-SNARK protocols dominate the ZKP usage, often implemented through arithmetic circuit programming paradigm. However, underconstrained or overconstrained circuits may lead to bugs. Underconstrained circuits refer to circuits that lack the necessary constraints, resulting in unexpected solutions in the circuit and causing the verifier to accept a bogus witness. Overconstrained circuits refer to circuits that are constrained excessively, resulting in the circuit lacking necessary solutions and causing the verifier to accept no witness, rendering the circuit meaningless. This paper introduces a novel approach for pinpointing two distinct types of bugs in ZKP circuits. The method involves encoding the arithmetic circuit constraints to polynomial equation systems and solving polynomial equation systems over a finite field by algebraic computation. The classification of verification results is refined, greatly enhancing the expressive power of the system. We proposed a tool, AC4, to represent the implementation of this method. Experiments demonstrate that AC4 represents a substantial 29% increase in the checked ratio compared to prior work. Within a solvable range, the checking time of AC4 has also exhibited noticeable improvement, demonstrating a magnitude increase compared to previous efforts.
comment: 20 pages, 4 figures
♻ ☆ SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code Generation
Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training samples available in practice lead to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training samples is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named SEED, which stands for Sample-Efficient adaptation with Error-Driven learning for code generation. SEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome its own shortcomings, thus achieving efficient learning. Specifically, SEED involves identifying error code generated by LLMs, employing Self-revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, SEED achieves superior performance with few training samples, showing an average relative improvement of 54.7% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, SEED consistently demonstrates strong performance across various LLMs, underscoring its generalizability.
♻ ☆ On The Effectiveness of One-Class Support Vector Machine in Different Defect Prediction Scenarios
Defect prediction aims at identifying software components that are likely to cause faults before a software is made available to the end-user. To date, this task has been modeled as a two-class classification problem, however its nature also allows it to be formulated as a one-class classification task. Previous studies show that One-Class Support Vector Machine (OCSVM) can outperform two-class classifiers for within-project defect prediction, however it is not effective when employed at a finer granularity (i.e., commit-level defect prediction). In this paper, we further investigate whether learning from one class only is sufficient to produce effective defect prediction model in two other different scenarios (i.e., granularity), namely cross-version and cross-project defect prediction models, as well as replicate the previous work at within-project granularity for completeness. Our empirical results confirm that OCSVM performance remain low at different granularity levels, that is, it is outperformed by the two-class Random Forest (RF) classifier for both cross-version and cross-project defect prediction. While, we cannot conclude that OCSVM is the best classifier, our results still show interesting findings. While OCSVM does not outperform RF, it still achieves performance superior to its two-class counterpart (i.e., SVM) as well as other two-class classifiers studied herein. We also observe that OCSVM is more suitable for both cross-version and cross-project defect prediction, rather than for within-project defect prediction, thus suggesting it performs better with heterogeneous data. We encourage further research on one-class classifiers for defect prediction as these techniques may serve as an alternative when data about defective modules is scarce or not available.
comment: Published at SANER'24 (Winner of the Best RENE paper award) see https://conf.researchr.org/details/saner-2024/saner-2024-reproducibility-studies-and-negative-results--rene--track-/78/On-The-Effectiveness-of-One-Class-Support-Vector-Machine-in-Different-Defect-Predicti
♻ ☆ A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We? ISSTA 2024
Log data have facilitated various tasks of software development and maintenance, such as testing, debugging and diagnosing. Due to the unstructured nature of logs, log parsing is typically required to transform log messages into structured data for automated log analysis. Given the abundance of log parsers that employ various techniques, evaluating these tools to comprehend their characteristics and performance becomes imperative. Loghub serves as a commonly used dataset for benchmarking log parsers, but it suffers from limited scale and representativeness, posing significant challenges for studies to comprehensively evaluate existing log parsers or develop new methods. This limitation is particularly pronounced when assessing these log parsers for production use. To address these limitations, we provide a new collection of annotated log datasets, denoted Loghub-2.0, which can better reflect the characteristics of log data in real-world software systems. Loghub-2.0 comprises 14 datasets with an average of 3.6 million log lines in each dataset. Based on Loghub-2.0, we conduct a thorough re-evaluation of 15 state-of-the-art log parsers in a more rigorous and practical setting. Particularly, we introduce a new evaluation metric to mitigate the sensitivity of existing metrics to imbalanced data distributions. We are also the first to investigate the granular performance of log parsers on logs that represent rare system events, offering in-depth details for software diagnosis. Accurately parsing such logs is essential, yet it remains a challenge. We believe this work could shed light on the evaluation and design of log parsers in practical settings, thereby facilitating their deployment in production systems.
comment: This paper was accepted by 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)
Formal Languages and Automata Theory
☆ A short survey around the pumping lemma for context-free languages
Following a seminar the present author gave to an Automata Theory course to computer science students, it will be presented, in a very synthetic and mostly selfcontained way, the principal properties of context free languages (CFL), with particular attention given to the Pumping Lemma (PL), and of grammars which generate them(CFG). We refer to Chomsky and Schutzenberger for the first works about it. What is known in literature as the Iteration Theorem here will be referred to as the Ogden's Lemma in a fully justified way. All definitions not strictly connected with the notion of context freeness will be omitted (we will give precise references for all of them). The symbology used is substantially the classical one, but we will replace some symbols to avoid confusion with those used in logic
♻ ☆ Automata and coalgebras in categories of species
We study generalized automata (in the sense of Ad\'amek-Trnkov\'a) in Joyal's category of (set-valued) combinatorial species, and as an important preliminary step, we study coalgebras for its derivative endofunctor $\partial$ and for the `Euler homogeneity operator' $L\circ\partial$ arising from the adjunction $L\dashv\partial\dashv R$.
comment: Hom. Il. 18.371-376
Information Retrieval
☆ Model, Analyze, and Comprehend User Interactions and Various Attributes within a Social Media Platform
How can we effectively model, analyze, and comprehend user interactions and various attributes within a social media platform based on post-comment relationship? In this study, we propose a novel graph-based approach to model and analyze user interactions within a social media platform based on post-comment relationship. We construct a user interaction graph from social media data and analyze it to gain insights into community dynamics, user behavior, and content preferences. Our investigation reveals that while 56.05% of the active users are strongly connected within the community, only 0.8% of them significantly contribute to its dynamics. Moreover, we observe temporal variations in community activity, with certain periods experiencing heightened engagement. Additionally, our findings highlight a correlation between user activity and popularity showing that more active users are generally more popular. Alongside these, a preference for positive and informative content is also observed where 82.41% users preferred positive and informative content. Overall, our study provides a comprehensive framework for understanding and managing online communities, leveraging graph-based techniques to gain valuable insights into user behavior and community dynamics.
comment: 9 Pages, 8 Figures, 3 Tables
☆ Towards Human-Like Machine Comprehension: Few-Shot Relational Learning in Visually-Rich Documents COLING2024
Key-value relations are prevalent in Visually-Rich Documents (VRDs), often depicted in distinct spatial regions accompanied by specific color and font styles. These non-textual cues serve as important indicators that greatly enhance human comprehension and acquisition of such relation triplets. However, current document AI approaches often fail to consider this valuable prior information related to visual and spatial features, resulting in suboptimal performance, particularly when dealing with limited examples. To address this limitation, our research focuses on few-shot relational learning, specifically targeting the extraction of key-value relation triplets in VRDs. Given the absence of a suitable dataset for this task, we introduce two new few-shot benchmarks built upon existing supervised benchmark datasets. Furthermore, we propose a variational approach that incorporates relational 2D-spatial priors and prototypical rectification techniques. This approach aims to generate relation representations that are more aware of the spatial context and unseen relation in a manner similar to human perception. Experimental results demonstrate the effectiveness of our proposed method by showcasing its ability to outperform existing methods. This study also opens up new possibilities for practical applications.
comment: 13 pages, 7 figures, accepted by LERC-COLING2024
☆ User-Side Realization
Users are dissatisfied with services. Since the service is not tailor-made for a user, it is natural for dissatisfaction to arise. The problem is, that even if users are dissatisfied, they often do not have the means to resolve their dissatisfaction. The user cannot alter the source code of the service, nor can they force the service provider to change. The user has no choice but to remain dissatisfied or quit the service. User-side realization offers proactive solutions to this problem by providing general algorithms to deal with common problems on the user's side. These algorithms run on the user's side and solve the problems without having the service provider change the service itself.
comment: Doctoral Thesis
☆ QueryExplorer: An Interactive Query Generation Assistant for Search and Exploration NAACL 2024
Formulating effective search queries remains a challenging task, particularly when users lack expertise in a specific domain or are not proficient in the language of the content. Providing example documents of interest might be easier for a user. However, such query-by-example scenarios are prone to concept drift, and the retrieval effectiveness is highly sensitive to the query generation method, without a clear way to incorporate user feedback. To enable exploration and to support Human-In-The-Loop experiments we propose QueryExplorer -- an interactive query generation, reformulation, and retrieval interface with support for HuggingFace generation models and PyTerrier's retrieval pipelines and datasets, and extensive logging of human feedback. To allow users to create and modify effective queries, our demo supports complementary approaches of using LLMs interactively, assisting the user with edits and feedback at multiple stages of the query formulation process. With support for recording fine-grained interactions and user annotations, QueryExplorer can serve as a valuable experimental and research platform for annotation, qualitative evaluation, and conducting Human-in-the-Loop (HITL) experiments for complex search tasks where users struggle to formulate queries.
comment: NAACL 2024 Demonstration Track
☆ Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models
Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to LLMs. How can \textit{\textbf{everyday web users}} confirm if LLMs misuse their data without permission? In this work, we suggest that users repeatedly insert personal passphrases into their documents, enabling LLMs to memorize them. These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training. To explore the effectiveness and usage of this copyrighting tool, we define the \textit{user training data identification} task with ghost sentences. Multiple datasets from various sources at different scales are created and tested with LLMs of different sizes. For evaluation, we introduce a last $k$ words verification manner along with two metrics: document and user identification accuracy. In the specific case of instruction tuning of a 3B LLaMA model, 11 out of 16 users with ghost sentences identify their data within the generation content. These 16 users contribute 383 examples to $\sim$1.8M training documents. For continuing pre-training of a 1.1B TinyLlama model, 61 out of 64 users with ghost sentences identify their data within the LLM output. These 64 users contribute 1156 examples to $\sim$10M training documents.
comment: Preprint, work in progress
♻ ☆ Large Language Models for Generative Recommendation: A Survey and Visionary Discussions LREC
Large language models (LLM) not only have revolutionized the field of natural language processing (NLP) but also have the potential to reshape many other fields, e.g., recommender systems (RS). However, most of the related work treats an LLM as a component of the conventional recommendation pipeline (e.g., as a feature extractor), which may not be able to fully leverage the generative power of LLM. Instead of separating the recommendation process into multiple stages, such as score computation and re-ranking, this process can be simplified to one stage with LLM: directly generating recommendations from the complete pool of items. This survey reviews the progress, methods, and future directions of LLM-based generative recommendation by examining three questions: 1) What generative recommendation is, 2) Why RS should advance to generative recommendation, and 3) How to implement LLM-based generative recommendation for various RS tasks. We hope that this survey can provide the context and guidance needed to explore this interesting and emerging topic.
comment: Published as a conference paper at LREC-COLING 2024
♻ ☆ No more optimization rules: LLM-enabled policy-based multi-modal query optimizer
Large language model (LLM) has marked a pivotal moment in the field of machine learning and deep learning. Recently its capability for query planning has been investigated, including both single-modal and multi-modal queries. However, there is no work on the query optimization capability of LLM. As a critical (or could even be the most important) step that significantly impacts the execution performance of the query plan, such analysis and attempts should not be missed. From another aspect, existing query optimizers are usually rule-based or rule-based + cost-based, i.e., they are dependent on manually created rules to complete the query plan rewrite/transformation. Given the fact that modern optimizers include hundreds to thousands of rules, designing a multi-modal query optimizer following a similar way is significantly time-consuming since we will have to enumerate as many multi-modal optimization rules as possible, which has not been well addressed today. In this paper, we investigate the query optimization ability of LLM and use LLM to design LaPuda, a novel LLM and Policy based multi-modal query optimizer. Instead of enumerating specific and detailed rules, LaPuda only needs a few abstract policies to guide LLM in the optimization, by which much time and human effort are saved. Furthermore, to prevent LLM from making mistakes or negative optimization, we borrow the idea of gradient descent and propose a guided cost descent (GCD) algorithm to perform the optimization, such that the optimization can be kept in the correct direction. In our evaluation, our methods consistently outperform the baselines in most cases. For example, the optimized plans generated by our methods result in 1~3x higher execution speed than those by the baselines.
comment: Yifan and Haodi contribute equally to the work
♻ ☆ In-context Learning with Retrieved Demonstrations for Language Models: A Survey
Language models, especially pre-trained large language models, have showcased remarkable abilities as few-shot in-context learners (ICL), adept at adapting to new tasks with just a few demonstrations in the input context. However, the model's ability to perform ICL is sensitive to the choice of the few-shot demonstrations. Instead of using a fixed set of demonstrations, one recent development is to retrieve demonstrations tailored to each input query. The implementation of demonstration retrieval is relatively straightforward, leveraging existing databases and retrieval systems. This not only improves the efficiency and scalability of the learning process but also has been shown to reduce biases inherent in manual example selection. In light of the encouraging results and growing research in ICL with retrieved demonstrations, we conduct an extensive review of studies in this area. In this survey, we discuss and compare different design choices for retrieval models, retrieval training procedures, and inference algorithms.
Data Structures and Algorithms
☆ SAT Encoding of Partial Ordering Models for Graph Coloring Problems
In this paper, we suggest new SAT encodings of the partial-ordering based ILP model for the graph coloring problem (GCP) and the bandwidth coloring problem (BCP). The GCP asks for the minimum number of colors that can be assigned to the vertices of a given graph such that each two adjacent vertices get different colors. The BCP is a generalization, where each edge has a weight that enforces a minimal "distance" between the assigned colors, and the goal is to minimize the "largest" color used. For the widely studied GCP, we experimentally compare our new SAT encoding to the state-of-the-art approaches on the DIMACS benchmark set. Our evaluation confirms that this SAT encoding is effective for sparse graphs and even outperforms the state-of-the-art on some DIMACS instances. For the BCP, our theoretical analysis shows that the partial-ordering based SAT and ILP formulations have an asymptotically smaller size than that of the classical assignment-based model. Our practical evaluation confirms not only a dominance compared to the assignment-based encodings but also to the state-of-the-art approaches on a set of benchmark instances. Up to our knowledge, we have solved several open instances of the BCP from the literature for the first time.
☆ On the complexity and approximability of Bounded access Lempel Ziv coding
We study the complexity of constructing an optimal parsing $\varphi$ of a string ${\bf s} = s_1 \dots s_n$ under the constraint that given a position $p$ in the original text, and the LZ76-like (Lempel Ziv 76) encoding of $T$ based on $\varphi$, it is possible to identify/decompress the character $s_p$ by performing at most $c$ accesses to the LZ encoding, for a given integer $c.$ We refer to such a parsing $\varphi$ as a $c$-bounded access LZ parsing or $c$-BLZ parsing of ${\bf s}.$ We show that for any constant $c$ the problem of computing the optimal $c$-BLZ parsing of a string, i.e., the one with the minimum number of phrases, is NP-hard and also APX hard, i.e., no PTAS can exist under the standard complexity assumption $P \neq NP.$ We also study the ratio between the sizes of an optimal $c$-BLZ parsing of a string ${\bf s}$ and an optimal LZ76 parsing of ${\bf s}$ (which can be greedily computed in polynomial time).
☆ Distance Adjustment of a Graph Drawing Stress Model
Stress models are a promising approach for graph drawing. They minimize the weighted sum of the squared errors of the Euclidean and desired distances for each node pair. The desired distance typically uses the graph-theoretic distances obtained from the all-node pair shortest path problem. In a minimized stress function, the obtained coordinates are affected by the non-Euclidean property and the high-dimensionality of the graph-theoretic distance matrix. Therefore, the graph-theoretic distances used in stress models may not necessarily be the best metric for determining the node coordinates. In this study, we propose two different methods of adjusting the graph-theoretical distance matrix to a distance matrix suitable for graph drawing while preserving its structure. The first method is the application of eigenvalue decomposition to the inner product matrix obtained from the distance matrix and the obtainment of a new distance matrix by setting some eigenvalues with small absolute values to zero. The second approach is the usage of a stress model modified by adding a term that minimizes the Frobenius norm between the adjusted and original distance matrices. We perform computational experiments using several benchmark graphs to demonstrate that the proposed method improves some quality metrics, including the node resolution and the Gabriel graph property, when compared to conventional stress models.
comment: 24 pages, 6 figures
♻ ☆ The landscape of compressibility measures for two-dimensional data
In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the $\gamma$ measure defined in terms of the smallest string attractor, and the $\delta$ measure defined in terms of the number of distinct substrings of the input string. Concretely, we introduce the two-dimensional measures $\gamma_{2D}$ and $\delta_{2D}$ as natural generalizations of $\gamma$ and $\delta$ and study some of their properties. Among other things, we prove that $\delta_{2D}$ is monotone and can be computed in linear time, and we show that, although it is still true that $\delta_{2D} \leq \gamma_{2D}$, the gap between the two measures can be $\Omega(\sqrt{n})$ for families of $n\times n$ matrices and therefore asymptotically larger than the gap in one-dimension. To complete the scenario of two-dimensional compressibility measures, we introduce also the measure $b_{2D}$ which generalizes to two dimensions the notion of optimal parsing. We prove that, somewhat surprisingly, the relationship between $b_{2D}$ and $\gamma_{2D}$ is significantly different than in the one-dimensional case. As an application of the measures $\gamma_{2D}$ and $\delta_{2D}$ we provide the first analysis of the space usage of the two-dimensional block tree introduced in [Brisaboa et al., Two-dimensional block trees, The computer Journal, 2023]. Finally, we present a linear time algorithm for constructing the two-dimensional block tree for arbitrary matrices, that is asymptotically faster than the (probabilistic) known solution which can only be used for binary matrices.
Signal Processing
☆ Energy Efficient Design of Active STAR-RIS-Aided SWIPT Systems
In this paper, we consider the downlink transmission of a multi-antenna base station (BS) supported by an active simultaneously transmitting and reconfigurable intelligent surface (STAR-RIS) to serve single-antenna users via simultaneous wireless information and power transfer (SWIPT). In this context, we formulate an energy efficiency maximisation problem that jointly optimises the gain, element selection and phase shift matrices of the active STAR-RIS, the transmit beamforming of the BS and the power splitting ratio of the users. With respect to the highly coupled and non-convex form of this problem, an alternating optimisation solution approach is proposed, using tools from convex optimisation and reinforcement learning. Specifically, semi-definite relaxation (SDR), difference of concave functions (DC), and fractional programming techniques are employed to transform the non-convex optimisation problem into a convex form for optimising the BS beamforming vector and the power splitting ratio of the SWIPT. Then, by integrating meta-learning with the modified deep deterministic policy gradient (DDPG) and soft actor-critical (SAC) methods, a combinatorial reinforcement learning network is developed to optimise the element selection, gain and phase shift matrices of the active STAR-RIS. Our simulations show the effectiveness of the proposed resource allocation scheme. Furthermore, our proposed active STAR-RIS-based SWIPT system outperforms its passive counterpart by 57% on average.
☆ Towards Channel-Resilient CSI-Based RF Fingerprinting using Deep Learning INFOCOM
This work introduces DeepCRF, a deep learning framework designed for channel state information-based radio frequency fingerprinting (CSI-RFF). The considered CSI-RFF is built on micro-CSI, a recently discovered radio-frequency (RF) fingerprint that manifests as micro-signals appearing on the channel state information (CSI) curves of commercial WiFi devices. Micro-CSI facilitates CSI-RFF which is more streamlined and easily implementable compared to existing schemes that rely on raw I/Q samples. The primary challenge resides in the precise extraction of micro-CSI from the inherently fluctuating CSI measurements, a process critical for reliable RFF. The construction of a framework that is resilient to channel variability is essential for the practical deployment of CSI-RFF techniques. DeepCRF addresses this challenge with a thoughtfully trained convolutional neural network (CNN). This network's performance is significantly enhanced by employing effective and strategic data augmentation techniques, which bolster its ability to generalize to novel, unseen channel conditions. Furthermore, DeepCRF incorporates supervised contrastive learning to enhance its robustness against noises. Our evaluations demonstrate that DeepCRF significantly enhances the accuracy of device identification across previously unencountered channels. It outperforms both the conventional model-based methods and standard CNN that lack our specialized training and enhancement strategies.
comment: 6 pages,5 figures, INFOCOM WKSHPS 2024
☆ Block Orthogonal Sparse Superposition Codes for $ \sf{L}^3 $ Communications: Low Error Rate, Low Latency, and Low Power Consumption
Block orthogonal sparse superposition (BOSS) code is a class of joint coded modulation methods, which can closely achieve the finite-blocklength capacity with a low-complexity decoder at a few coding rates under Gaussian channels. However, for fading channels, the code performance degrades considerably because coded symbols experience different channel fading effects. In this paper, we put forth novel joint demodulation and decoding methods for BOSS codes under fading channels. For a fast fading channel, we present a minimum mean square error approximate maximum a posteriori (MMSE-A-MAP) algorithm for the joint demodulation and decoding when channel state information is available at the receiver (CSIR). We also propose a joint demodulation and decoding method without using CSIR for a block fading channel scenario. We refer to this as the non-coherent sphere decoding (NSD) algorithm. Simulation results demonstrate that BOSS codes with MMSE-A-MAP decoding outperform CRC-aided polar codes, while NSD decoding achieves comparable performance to quasi-maximum likelihood decoding with significantly reduced complexity. Both decoding algorithms are suitable for parallelization, satisfying low-latency constraints. Additionally, real-time simulations on a software-defined radio testbed validate the feasibility of using BOSS codes for low-power transmission.
♻ ☆ Enhancing Physical Layer Security in Dual-Function Radar-Communication Systems with Hybrid Beamforming Architecture
In this letter, we investigate enhancing the physical layer security (PLS) for the dual-function radar-communication (DFRC) system with hybrid beamforming (HBF) architecture, where the base station (BS) achieves downlink communication and radar target detection simultaneously. We consider an eavesdropper intercepting the information transmitted from the BS to the downlink communication users with imperfectly known channel state information. Additionally, the location of the radar target is also imperfectly known by the BS. To enhance PLS in the considered DFRC system, we propose a novel HBF architecture, which introduces a new integrated sensing and security (I2S) symbol. The secure HBF design problem for DFRC is formulated by maximizing the minimum legitimate user communication rate subject to radar signal-to-interference-plus-noise ratio, eavesdropping rate, hardware and power constraints. To solve this non-convex problem, we propose an alternating optimization based method to jointly optimize transmit and receive beamformers. Numerical simulation results validate the effectiveness of the proposed algorithm and show the superiority of the proposed I2S-aided HBF architecture for achieving DFRC and enhancing PLS.
♻ ☆ Arbitrary Discrete Fourier Analysis and Its Application in Replayed Speech Detection
In this paper, a group of finite sequences and its variants were proposed to use in conducting signal analysis; we called the developed signal analysis methods arbitrary discrete Fourier analysis (ADFA), Mel-scale discrete Fourier analysis (MDFA) and constant Q analysis (CQA). The effectiveness of three signal analysis methods were then validated by testing their performance on a replayed speech detection benchmark (i.e., the ASVspoof 2019 Physical Access) along with a state-of-the-art model. Comparable performance to the best reported systems were shown by the experimental results with three signal analysis methods. Furthermore, the CQA method shown its efficiency with less computation time in compared to the convention method constant Q transform (CQT), which is commonly used in spoofed and fake speech detection and music processing.
comment: https://github.com/shihkuanglee/ADFA
♻ ☆ Theoretical Analysis of the Radio Map Estimation Problem
Radio maps provide radio frequency metrics, such as the received signal strength, at every location of a geographic area. These maps, which are estimated using a set of measurements collected at multiple positions, find a wide range of applications in wireless communications, including the prediction of coverage holes, network planning, resource allocation, and path planning for mobile robots. Although a vast number of estimators have been proposed, the theoretical understanding of the radio map estimation (RME) problem has not been addressed. The present work aims at filling this gap along two directions. First, the complexity of the set of radio map functions is quantified by means of lower and upper bounds on their spatial variability, which offers valuable insight into the required spatial distribution of measurements and the estimators that can be used. Second, the reconstruction error for power maps in free space is upper bounded for three conventional spatial interpolators. The proximity coefficient, which is a decreasing function of the distance from the transmitters to the mapped region, is proposed to quantify the complexity of the RME problem. Numerical experiments assess the tightness of the obtained bounds and the validity of the main takeaways in complex environments.
♻ ☆ A learning-based solution approach to the application placement problem in mobile edge computing under uncertainty
Placing applications in mobile edge computing servers presents a complex challenge involving many servers, users, and their requests. Existing algorithms take a long time to solve high-dimensional problems with significant uncertainty scenarios. Therefore, an efficient approach is required to maximize the quality of service while considering all technical constraints. One of these approaches is machine learning, which emulates optimal solutions for application placement in edge servers. Machine learning models are expected to learn how to allocate user requests to servers based on the spatial positions of users and servers. In this study, the problem is formulated as a two-stage stochastic programming. A sufficient amount of training records is generated by varying parameters such as user locations, their request rates, and solving the optimization model. Then, based on the distance features of each user from the available servers and their request rates, machine learning models generate decision variables for the first stage of the stochastic optimization model, which is the user-to-server request allocation, and are employed as independent decision agents that reliably mimic the optimization model. Support Vector Machines (SVM) and Multi-layer Perceptron (MLP) are used in this research to achieve practical decisions from the stochastic optimization models. The performance of each model has shown an execution effectiveness of over 80%. This research aims to provide a more efficient approach for tackling high-dimensional problems and scenarios with uncertainties in mobile edge computing by leveraging machine learning models for optimal decision-making in request allocation to edge servers. These results suggest that machine-learning models can significantly improve solution times compared to conventional approaches.
♻ ☆ Modeling Sparse Graph Sequences and Signals Using Generalized Graphons
Graphons are limit objects of sequences of graphs and are used to analyze the behavior of large graphs. Recently, graphon signal processing has been developed to study signal processing on large graphs. A major limitation of this approach is that any sparse sequence of graphs inevitably converges to the zero graphon, rendering the resulting signal processing theory trivial and inadequate for sparse graph sequences. To overcome this limitation, we propose a new signal processing framework that leverages the concept of generalized graphons and introduces the stretched cut distance as a measure to compare these graphons. Our framework focuses on the sampling of graph sequences from generalized graphons and explores the convergence properties of associated operators, spectra, and signals. Our signal processing framework provides a comprehensive approach to analyzing and processing signals on graph sequences, even if they are sparse. Finally, we discuss the practical implications of our theory for real-world large networks through numerical experiments.
Sound
♻ ☆ Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks ICASSP2024
Brain-inspired spiking neural networks (SNNs) have demonstrated great potential for temporal signal processing. However, their performance in speech processing remains limited due to the lack of an effective auditory front-end. To address this limitation, we introduce Spiking-LEAF, a learnable auditory front-end meticulously designed for SNN-based speech processing. Spiking-LEAF combines a learnable filter bank with a novel two-compartment spiking neuron model called IHC-LIF. The IHC-LIF neurons draw inspiration from the structure of inner hair cells (IHC) and they leverage segregated dendritic and somatic compartments to effectively capture multi-scale temporal dynamics of speech signals. Additionally, the IHC-LIF neurons incorporate the lateral feedback mechanism along with spike regularization loss to enhance spike encoding efficiency. On keyword spotting and speaker identification tasks, the proposed Spiking-LEAF outperforms both SOTA spiking auditory front-ends and conventional real-valued acoustic features in terms of classification accuracy, noise robustness, and encoding efficiency.
comment: Accepted by ICASSP2024
Robotics
☆ Risk-Calibrated Human-Robot Interaction via Set-Valued Intent Prediction
Tasks where robots must cooperate with humans, such as navigating around a cluttered home or sorting everyday items, are challenging because they exhibit a wide range of valid actions that lead to similar outcomes. Moreover, zero-shot cooperation between human-robot partners is an especially challenging problem because it requires the robot to infer and adapt on the fly to a latent human intent, which could vary significantly from human to human. Recently, deep learned motion prediction models have shown promising results in predicting human intent but are prone to being confidently incorrect. In this work, we present Risk-Calibrated Interactive Planning (RCIP), which is a framework for measuring and calibrating risk associated with uncertain action selection in human-robot cooperation, with the fundamental idea that the robot should ask for human clarification when the risk associated with the uncertainty in the human's intent cannot be controlled. RCIP builds on the theory of set-valued risk calibration to provide a finite-sample statistical guarantee on the cumulative loss incurred by the robot while minimizing the cost of human clarification in complex multi-step settings. Our main insight is to frame the risk control problem as a sequence-level multi-hypothesis testing problem, allowing efficient calibration using a low-dimensional parameter that controls a pre-trained risk-aware policy. Experiments across a variety of simulated and real-world environments demonstrate RCIP's ability to predict and adapt to a diverse set of dynamic human intents.
comment: Website with additional information, videos, and code: https://risk-calibrated-planning.github.io/
☆ Explore until Confident: Efficient Exploration for Embodied Question Answering
We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM - leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration - leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence. Webpage with experiment videos and code: https://explore-eqa.github.io/
comment: Under review
☆ iA$^*$: Imperative Learning-based A$^*$ Search for Pathfinding
The pathfinding problem, which aims to identify a collision-free path between two points, is crucial for many applications, such as robot navigation and autonomous driving. Classic methods, such as A$^*$ search, perform well on small-scale maps but face difficulties scaling up. Conversely, data-driven approaches can improve pathfinding efficiency but require extensive data labeling and lack theoretical guarantees, making it challenging for practical applications. To combine the strengths of the two methods, we utilize the imperative learning (IL) strategy and propose a novel self-supervised pathfinding framework, termed imperative learning-based A$^*$ (iA$^*$). Specifically, iA$^*$ is a bilevel optimization process where the lower-level optimization is dedicated to finding the optimal path by a differentiable A$^*$ search module, and the upper-level optimization narrows down the search space to improve efficiency via setting suitable initial values from a data-driven model. Besides, the model within the upper-level optimization is a fully convolutional network, trained by the calculated loss in the lower-level optimization. Thus, the framework avoids extensive data labeling and can be applied in diverse environments. Our comprehensive experiments demonstrate that iA$^*$ surpasses both classical and data-driven methods in pathfinding efficiency and shows superior robustness among different tasks, validated with public datasets and simulation environments.
☆ Automated System-level Testing of Unmanned Aerial Systems
Unmanned aerial systems (UAS) rely on various avionics systems that are safety-critical and mission-critical. A major requirement of international safety standards is to perform rigorous system-level testing of avionics software systems. The current industrial practice is to manually create test scenarios, manually/automatically execute these scenarios using simulators, and manually evaluate outcomes. The test scenarios typically consist of setting certain flight or environment conditions and testing the system under test in these settings. The state-of-the-art approaches for this purpose also require manual test scenario development and evaluation. In this paper, we propose a novel approach to automate the system-level testing of the UAS. The proposed approach (AITester) utilizes model-based testing and artificial intelligence (AI) techniques to automatically generate, execute, and evaluate various test scenarios. The test scenarios are generated on the fly, i.e., during test execution based on the environmental context at runtime. The approach is supported by a toolset. We empirically evaluate the proposed approach on two core components of UAS, an autopilot system of an unmanned aerial vehicle (UAV) and cockpit display systems (CDS) of the ground control station (GCS). The results show that the AITester effectively generates test scenarios causing deviations from the expected behavior of the UAV autopilot and reveals potential flaws in the GCS-CDS.
☆ ARO: Large Language Model Supervised Robotics Text2Skill Autonomous Learning
Robotics learning highly relies on human expertise and efforts, such as demonstrations, design of reward functions in reinforcement learning, performance evaluation using human feedback, etc. However, reliance on human assistance can lead to expensive learning costs and make skill learning difficult to scale. In this work, we introduce the Large Language Model Supervised Robotics Text2Skill Autonomous Learning (ARO) framework, which aims to replace human participation in the robot skill learning process with large-scale language models that incorporate reward function design and performance evaluation. We provide evidence that our approach enables fully autonomous robot skill learning, capable of completing partial tasks without human intervention. Furthermore, we also analyze the limitations of this approach in task understanding and optimization stability.
comment: 6 pages, 2 figures
☆ Scaling Learning based Policy Optimization for Temporal Tasks via Dropout
This paper introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear environment. We desire the trained policy to ensure that the agent satisfies specific task objectives, expressed in discrete-time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the robustness, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback controllers, and we assume a feed forward neural network for learning these feedback controllers. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent's task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and na\"{i}ve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To tackle this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. We show that, the existing smooth semantics for robustness are inefficient regarding gradient computation when the specification becomes complex. To address this challenge, we propose a new smooth semantics for DT-STL that under-approximates the robustness value and scales well for backpropagation over a complex specification. We show that our control synthesis methodology, can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable backpropagation over long time horizons and trajectories over high dimensional state spaces.
☆ Learning Early Social Maneuvers for Enhanced Social Navigation ICRA 2024
Socially compliant navigation is an integral part of safety features in Human-Robot Interaction. Traditional approaches to mobile navigation prioritize physical aspects, such as efficiency, but social behaviors gain traction as robots appear more in daily life. Recent techniques to improve the social compliance of navigation often rely on predefined features or reward functions, introducing assumptions about social human behavior. To address this limitation, we propose a novel Learning from Demonstration (LfD) framework for social navigation that exclusively utilizes raw sensory data. Additionally, the proposed system contains mechanisms to consider the future paths of the surrounding pedestrians, acknowledging the temporal aspect of the problem. The final product is expected to reduce the anxiety of people sharing their environment with a mobile robot, helping them trust that the robot is aware of their presence and will not harm them. As the framework is currently being developed, we outline its components, present experimental results, and discuss future work towards realizing this framework.
comment: Submitted to the workshop of Robot Trust for Symbiotic Societies (RTSS) at ICRA 2024 on March 23, 2024
☆ The Impact of Evolutionary Computation on Robotic Design: A Case Study with an Underactuated Hand Exoskeleton ICRA
Robotic exoskeletons can enhance human strength and aid people with physical disabilities. However, designing them to ensure safety and optimal performance presents significant challenges. Developing exoskeletons should incorporate specific optimization algorithms to find the best design. This study investigates the potential of Evolutionary Computation (EC) methods in robotic design optimization, with an underactuated hand exoskeleton (U-HEx) used as a case study. We propose improving the performance and usability of the U-HEx design, which was initially optimized using a naive brute-force approach, by integrating EC techniques such as Genetic Algorithm and Big Bang-Big Crunch Algorithm. Comparative analysis revealed that EC methods consistently yield more precise and optimal solutions than brute force in a significantly shorter time. This allowed us to improve the optimization by increasing the number of variables in the design, which was impossible with naive methods. The results show significant improvements in terms of the torque magnitude the device transfers to the user, enhancing its efficiency. These findings underline the importance of performing proper optimization while designing exoskeletons, as well as providing a significant improvement to this specific robotic design.
comment: 6 pages (+ref), 4 figures, IEEE International Conference on Robotics and Automation (ICRA) 2024
☆ AirCrab: A Hybrid Aerial-Ground Manipulator with An Active Wheel
Inspired by the behavior of birds, we present AirCrab, a hybrid aerial ground manipulator (HAGM) with a single active wheel and a 3-degree of freedom (3-DoF) manipulator. AirCrab leverages a single point of contact with the ground to reduce position drift and improve manipulation accuracy. The single active wheel enables locomotion on narrow surfaces without adding significant weight to the robot. To realize accurate attitude maintenance using propellers on the ground, we design a control allocation method for AirCrab that prioritizes attitude control and dynamically adjusts the thrust input to reduce energy consumption. Experiments verify the effectiveness of the proposed control method and the gain in manipulation accuracy with ground contact. A series of operations to complete the letters 'NTU' demonstrates the capability of the robot to perform challenging hybrid aerial-ground manipulation missions.
☆ Vid2Real HRI: Align video-based HRI study designs with real-world settings
HRI research using autonomous robots in real-world settings can produce results with the highest ecological validity of any study modality, but many difficulties limit such studies' feasibility and effectiveness. We propose Vid2Real HRI, a research framework to maximize real-world insights offered by video-based studies. The Vid2Real HRI framework was used to design an online study using first-person videos of robots as real-world encounter surrogates. The online study ($n = 385$) distinguished the within-subjects effects of four robot behavioral conditions on perceived social intelligence and human willingness to help the robot enter an exterior door. A real-world, between-subjects replication ($n = 26$) using two conditions confirmed the validity of the online study's findings and the sufficiency of the participant recruitment target ($22$) based on a power analysis of online study results. The Vid2Real HRI framework offers HRI researchers a principled way to take advantage of the efficiency of video-based study modalities while generating directly transferable knowledge of real-world HRI. Code and data from the study are provided at https://vid2real.github.io/vid2realHRI
☆ DriveEnv-NeRF: Exploration of A NeRF-Based Autonomous Driving Environment for Real-World Performance Validation
In this study, we introduce the DriveEnv-NeRF framework, which leverages Neural Radiance Fields (NeRF) to enable the validation and faithful forecasting of the efficacy of autonomous driving agents in a targeted real-world scene. Standard simulator-based rendering often fails to accurately reflect real-world performance due to the sim-to-real gap, which represents the disparity between virtual simulations and real-world conditions. To mitigate this gap, we propose a workflow for building a high-fidelity simulation environment of the targeted real-world scene using NeRF. This approach is capable of rendering realistic images from novel viewpoints and constructing 3D meshes for emulating collisions. The validation of these capabilities through the comparison of success rates in both simulated and real environments demonstrates the benefits of using DriveEnv-NeRF as a real-world performance indicator. Furthermore, the DriveEnv-NeRF framework can serve as a training environment for autonomous driving agents under various lighting conditions. This approach enhances the robustness of the agents and reduces performance degradation when deployed to the target real scene, compared to agents fully trained using the standard simulator rendering pipeline.
comment: Project page: https://github.com/muyishen2040/DriveEnvNeRF
☆ RicMonk: A Three-Link Brachiation Robot with Passive Grippers for Energy-Efficient Brachiation
This paper presents the design, analysis, and performance evaluation of RicMonk, a novel three-link brachiation robot equipped with passive hook-shaped grippers. Brachiation, an agile and energy-efficient mode of locomotion observed in primates, has inspired the development of RicMonk to explore versatile locomotion and maneuvers on ladder-like structures. The robot's anatomical resemblance to gibbons and the integration of a tail mechanism for energy injection contribute to its unique capabilities. The paper discusses the use of the Direct Collocation methodology for optimizing trajectories for the robot's dynamic behaviors and stabilization of these trajectories using a Time-varying Linear Quadratic Regulator. With RicMonk we demonstrate bidirectional brachiation, and provide comparative analysis with its predecessor, AcroMonk - a two-link brachiation robot, to demonstrate that the presence of a passive tail helps improve energy efficiency. The system design, controllers, and software implementation are publicly available on GitHub and the video demonstration of the experiments can be viewed YouTube.
comment: Open sourced system design, controllers, software implementation can be found at https://github.com/dfki-ric-underactuated-lab/ricmonk and a video demonstrating the experiments performed with RicMonk can be found at https://www.youtube.com/watch?v=hOuDQI7CD8w
☆ Distributed Robust Learning based Formation Control of Mobile Robots based on Bioinspired Neural Dynamics
This paper addresses the challenges of distributed formation control in multiple mobile robots, introducing a novel approach that enhances real-world practicability. We first introduce a distributed estimator using a variable structure and cascaded design technique, eliminating the need for derivative information to improve the real time performance. Then, a kinematic tracking control method is developed utilizing a bioinspired neural dynamic-based approach aimed at providing smooth control inputs and effectively resolving the speed jump issue. Furthermore, to address the challenges for robots operating with completely unknown dynamics and disturbances, a learning-based robust dynamic controller is developed. This controller provides real time parameter estimates while maintaining its robustness against disturbances. The overall stability of the proposed method is proved with rigorous mathematical analysis. At last, multiple comprehensive simulation studies have shown the advantages and effectiveness of the proposed method.
comment: This paper is accepted by IEEE Transactions on Intelligent Vehicles
☆ PNAS-MOT: Multi-Modal Object Tracking with Pareto Neural Architecture Search
Multiple object tracking is a critical task in autonomous driving. Existing works primarily focus on the heuristic design of neural networks to obtain high accuracy. As tracking accuracy improves, however, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy. Another challenge for object tracking is the unreliability of a single sensor, therefore, we propose a multi-modal framework to improve the robustness. Experiments demonstrate that our algorithm can run on edge devices within lower latency constraints, thus greatly reducing the computational requirements for multi-modal object tracking while keeping lower latency.
comment: IEEE Robotics and Automation Letters 2024. Code is available at https://github.com/PholyPeng/PNAS-MOT
☆ Data-Driven Predictive Control for Robust Exoskeleton Locomotion
Exoskeleton locomotion must be robust while being adaptive to different users with and without payloads. To address these challenges, this work introduces a data-driven predictive control (DDPC) framework to synthesize walking gaits for lower-body exoskeletons, employing Hankel matrices and a state transition matrix for its data-driven model. The proposed approach leverages DDPC through a multi-layer architecture. At the top layer, DDPC serves as a planner employing Hankel matrices and a state transition matrix to generate a data-driven model that can learn and adapt to varying users and payloads. At the lower layer, our method incorporates inverse kinematics and passivity-based control to map the planned trajectory from DDPC into the full-order states of the lower-body exoskeleton. We validate the effectiveness of this approach through numerical simulations and hardware experiments conducted on the Atalante lower-body exoskeleton with different payloads. Moreover, we conducted a comparative analysis against the model predictive control (MPC) framework based on the reduced-order linear inverted pendulum (LIP) model. Through this comparison, the paper demonstrates that DDPC enables robust bipedal walking at various velocities while accounting for model uncertainties and unknown perturbations.
♻ ☆ LONER: LiDAR Only Neural Representations for Real-Time SLAM
This paper proposes LONER, the first real-time LiDAR SLAM algorithm that uses a neural implicit scene representation. Existing implicit mapping methods for LiDAR show promising results in large-scale reconstruction, but either require groundtruth poses or run slower than real-time. In contrast, LONER uses LiDAR data to train an MLP to estimate a dense map in real-time, while simultaneously estimating the trajectory of the sensor. To achieve real-time performance, this paper proposes a novel information-theoretic loss function that accounts for the fact that different regions of the map may be learned to varying degrees throughout online training. The proposed method is evaluated qualitatively and quantitatively on two open-source datasets. This evaluation illustrates that the proposed loss function converges faster and leads to more accurate geometry reconstruction than other loss functions used in depth-supervised neural implicit frameworks. Finally, this paper shows that LONER estimates trajectories competitively with state-of-the-art LiDAR SLAM methods, while also producing dense maps competitive with existing real-time implicit mapping methods that use groundtruth poses.
comment: First two authors equally contributed. Webpage: https://umautobots.github.io/loner
♻ ☆ ARTEMIS: AI-driven Robotic Triage Labeling and Emergency Medical Information System
Mass casualty incidents (MCIs) pose a significant challenge to emergency medical services by overwhelming available resources and personnel. Effective victim assessment is the key to minimizing casualties during such a crisis. We introduce ARTEMIS, an AI-driven Robotic Triage Labeling and Emergency Medical Information System, to aid first responders in MCI events. It leverages speech processing, natural language processing, and deep learning to help with acuity classification. This is deployed on a quadruped that performs victim localization and preliminary injury severity assessment. First responders access victim information through a Graphical User Interface that is updated in real-time. To validate our proposed algorithmic triage protocol, we used the Unitree Go1 quadruped. The robot identifies humans, interacts with them, gets vitals and information, and assigns an acuity label. Simulations of an MCI in software and a controlled environment outdoors were conducted. The system achieved a triage-level classification precision of over 74% on average and 99% for the most critical victims, i.e. level 1 acuity, outperforming state-of-the-art deep learning-based triage labeling systems. In this paper, we showcase the potential of human-robot interaction in assisting medical personnel in MCI events.
♻ ☆ NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavour to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometer and depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision-making and instruction following. We train NaVid with 550k navigation samples collected from VLN-CE trajectories, including action-planning and instruction-reasoning samples, along with 665k large-scale web data. Extensive experiments show that NaVid achieves SOTA performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.
♻ ☆ Tactile Estimation of Extrinsic Contact Patch for Stable Placement ICRA2024
Precise perception of contact interactions is essential for fine-grained manipulation skills for robots. In this paper, we present the design of feedback skills for robots that must learn to stack complex-shaped objects on top of each other (see Fig.1). To design such a system, a robot should be able to reason about the stability of placement from very gentle contact interactions. Our results demonstrate that it is possible to infer the stability of object placement based on tactile readings during contact formation between the object and its environment. In particular, we estimate the contact patch between a grasped object and its environment using force and tactile observations to estimate the stability of the object during a contact formation. The contact patch could be used to estimate the stability of the object upon release of the grasp. The proposed method is demonstrated in various pairs of objects that are used in a very popular board game.
comment: Accepted at ICRA2024
♻ ☆ Integration of Large Language Models within Cognitive Architectures for Autonomous Robots IROS 2024
Symbolic reasoning systems have been used in cognitive architectures to provide inference and planning capabilities. However, defining domains and problems has proven difficult and prone to errors. Moreover, Large Language Models (LLMs) have emerged as tools to process natural language for different tasks. In this paper, we propose the use of LLMs to tackle these problems. This way, this paper proposes the integration of LLMs in the ROS 2-integrated cognitive architecture MERLIN2 for autonomous robots. Specifically, we present the design, development and deployment of how to leverage the reasoning capabilities of LLMs inside the deliberative processes of MERLIN2. As a result, the deliberative system is updated from a PDDL-based planner system to a natural language planning system. This proposal is evaluated quantitatively and qualitatively, measuring the impact of incorporating the LLMs in the cognitive architecture. Results show that a classical approach achieves better performance but the proposed solution provides an enhanced interaction through natural language.
comment: 8 pages, 6 figures, 2 tables, Submitted to IROS 2024
♻ ☆ Exploring 3D Human Pose Estimation and Forecasting from the Robot's Perspective: The HARPER Dataset
We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and Spot, the quadruped robot manufactured by Boston Dynamics. The key-novelty is the focus on the robot's perspective, i.e., on the data captured by the robot's sensors. These make 3D body pose analysis challenging because being close to the ground captures humans only partially. The scenario underlying HARPER includes 15 actions, of which 10 involve physical contact between the robot and users. The Corpus contains not only the recordings of the built-in stereo cameras of Spot, but also those of a 6-camera OptiTrack system (all recordings are synchronized). This leads to ground-truth skeletal representations with a precision lower than a millimeter. In addition, the Corpus includes reproducible benchmarks on 3D Human Pose Estimation, Human Pose Forecasting, and Collision Prediction, all based on publicly available baseline approaches. This enables future HARPER users to rigorously compare their results with those we provide in this work.
♻ ☆ Space Filling Curves for Coverage Path Planning with Online Obstacle Avoidance
The paper presents a strategy for robotic exploration problem using Space-Filling curves (SFC). The strategy plans a path that avoids unknown obstacles while ensuring complete coverage of the free space in region of interest. The region of interest is first tessellated, and the tiles/cells are connected using a SFC pattern. A robot follows the SFC to explore the entire area. However, obstacles can block the systematic movement of the robot. We overcome this problem by determining an alternate path online that avoids the blocked cells while ensuring all the accessible cells are visited at least once. The proposed strategy chooses next waypoint based on the graph connectivity of the cells and the obstacle encountered so far. It is online, exhaustive and works in situations demanding non-uniform coverage. The completeness of the strategy is proved and its desirable properties are discussed with examples.
♻ ☆ Fully Spiking Neural Network for Legged Robots
In recent years, legged robots based on deep reinforcement learning have made remarkable progress. Quadruped robots have demonstrated the ability to complete challenging tasks in complex environments and have been deployed in real-world scenarios to assist humans. Simultaneously, bipedal and humanoid robots have achieved breakthroughs in various demanding tasks. Current reinforcement learning methods can utilize diverse robot bodies and historical information to perform actions. However, prior research has not emphasized the speed and energy consumption of network inference, as well as the biological significance of the neural networks themselves. Most of the networks employed are traditional artificial neural networks that utilize multilayer perceptrons (MLP). In this paper, we successfully apply a novel Spiking Neural Network (SNN) to process legged robots, achieving outstanding results across a range of simulated terrains. SNN holds a natural advantage over traditional neural networks in terms of inference speed and energy consumption, and their pulse-form processing of body perception signals offers improved biological interpretability. Applying more biomimetic neural networks to legged robots can further reduce the heat dissipation and structural burden caused by the high power consumption of neural networks. To the best of our knowledge, this is the first work to implement SNN in legged robots.
♻ ☆ SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models IROS 2024
In this work, we introduce SMART-LLM, an innovative framework designed for embodied multi-robot task planning. SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models (LLMs), harnesses the power of LLMs to convert high-level task instructions provided as input into a multi-robot task plan. It accomplishes this by executing a series of stages, including task decomposition, coalition formation, and task allocation, all guided by programmatic LLM prompts within the few-shot prompting paradigm. We create a benchmark dataset designed for validating the multi-robot task planning problem, encompassing four distinct categories of high-level instructions that vary in task complexity. Our evaluation experiments span both simulation and real-world scenarios, demonstrating that the proposed model can achieve promising results for generating multi-robot task plans. The experimental videos, code, and datasets from the work can be found at https://sites.google.com/view/smart-llm/.
comment: Submitted to IROS 2024
♻ ☆ SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation
The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity, neglecting their complex structures and fine-grained details; and (2) Instrument category-based prompts are not flexible and informative enough to describe instrument structures. To address these problems, in this paper, we investigate text promptable SIS and propose SurgicalPart-SAM (SP-SAM), a novel SAM efficient-tuning approach that explicitly integrates instrument structure knowledge with SAM's generic knowledge, guided by expert knowledge on instrument part compositions. Specifically, we achieve this by proposing (1) Collaborative Prompts that describe instrument structures via collaborating category-level and part-level texts; (2) Cross-Modal Prompt Encoder that encodes text prompts jointly with visual embeddings into discriminative part-level representations; and (3) Part-to-Whole Adaptive Fusion and Hierarchical Decoding that adaptively fuse the part-level representations into a whole for accurate instrument segmentation in surgical scenarios. Built upon them, SP-SAM acquires a better capability to comprehend surgical instruments in terms of both overall structure and part-level details. Extensive experiments on both the EndoVis2018 and EndoVis2017 datasets demonstrate SP-SAM's state-of-the-art performance with minimal tunable parameters. The code will be available at https://github.com/wenxi-yue/SurgicalPart-SAM.
comment: Technical Report. The source code will be released at https://github.com/wenxi-yue/SurgicalPart-SAM
♻ ☆ Bird's Eye View Based Pretrained World model for Visual Navigation IROS 2024
Sim2Real transfer has gained popularity because it helps transfer from inexpensive simulators to real world. This paper presents a novel system that fuses components in a traditional World Model into a robust system, trained entirely within a simulator, that Zero-Shot transfers to the real world. To facilitate transfer, we use an intermediary representation that is based on \textit{Bird's Eye View (BEV)} images. Thus, our robot learns to navigate in a simulator by first learning to translate from complex \textit{First-Person View (FPV)} based RGB images to BEV representations, then learning to navigate using those representations. Later, when tested in the real world, the robot uses the perception model that translates FPV-based RGB images to embeddings that were learned by the FPV to BEV translator and that can be used by the downstream policy. The incorporation of state-checking modules using \textit{Anchor images} and Mixture Density LSTM not only interpolates uncertain and missing observations but also enhances the robustness of the model in the real-world. We trained the model using data from a Differential drive robot in the CARLA simulator. Our methodology's effectiveness is shown through the deployment of trained models onto a real-world Differential drive robot. Lastly we release a comprehensive codebase, dataset and models for training and deployment (\url{https://sites.google.com/view/value-explicit-pretraining}).
comment: Under Review at the IROS 2024; Accepted at NeurIPS 2023, Robot Learning Workshop
Databases
☆ TablePuppet: A Generic Framework for Relational Federated Learning
Current federated learning (FL) approaches view decentralized training data as a single table, divided among participants either horizontally (by rows) or vertically (by columns). However, these approaches are inadequate for handling distributed relational tables across databases. This scenario requires intricate SQL operations like joins and unions to obtain the training data, which is either costly or restricted by privacy concerns. This raises the question: can we directly run FL on distributed relational tables? In this paper, we formalize this problem as relational federated learning (RFL). We propose TablePuppet, a generic framework for RFL that decomposes the learning process into two steps: (1) learning over join (LoJ) followed by (2) learning over union (LoU). In a nutshell, LoJ pushes learning down onto the vertical tables being joined, and LoU further pushes learning down onto the horizontal partitions of each vertical table. TablePuppet incorporates computation/communication optimizations to deal with the duplicate tuples introduced by joins, as well as differential privacy (DP) to protect against both feature and label leakages. We demonstrate the efficiency of TablePuppet in combination with two widely-used ML training algorithms, stochastic gradient descent (SGD) and alternating direction method of multipliers (ADMM), and compare their computation/communication complexity. We evaluate the SGD/ADMM algorithms developed atop TablePuppet by training diverse ML models. Our experimental results show that TablePuppet achieves model accuracy comparable to the centralized baselines running directly atop the SQL results. Moreover, ADMM takes less communication time than SGD to converge to similar model accuracy.
comment: 14 pages, 8 figures
☆ Efficient Data Access Paths for Mixed Vector-Relational Search
The rapid growth of machine learning capabilities and the adoption of data processing methods using vector embeddings sparked a great interest in creating systems for vector data management. While the predominant approach of vector data management is to use specialized index structures for fast search over the entirety of the vector embeddings, once combined with other (meta)data, the search queries can also become selective on relational attributes - typical for analytical queries. As using vector indexes differs from traditional relational data access, we revisit and analyze alternative access paths for efficient mixed vector-relational search. We first evaluate the accurate but exhaustive scan-based search and propose hardware optimizations and alternative tensor-based formulation and batching to offset the cost. We outline the complex access-path design space, primarily driven by relational selectivity, and the decisions to consider when selecting an exhaustive scan-based search against an approximate index-based approach. Since the vector index primarily avoids expensive computation across the entire dataset, contrary to the common relational knowledge, it is better to scan at lower selectivity and probe at higher, with a cross-point between the two approaches dictated by data dimensionality and the number of concurrent search queries.
♻ ☆ No more optimization rules: LLM-enabled policy-based multi-modal query optimizer
Large language model (LLM) has marked a pivotal moment in the field of machine learning and deep learning. Recently its capability for query planning has been investigated, including both single-modal and multi-modal queries. However, there is no work on the query optimization capability of LLM. As a critical (or could even be the most important) step that significantly impacts the execution performance of the query plan, such analysis and attempts should not be missed. From another aspect, existing query optimizers are usually rule-based or rule-based + cost-based, i.e., they are dependent on manually created rules to complete the query plan rewrite/transformation. Given the fact that modern optimizers include hundreds to thousands of rules, designing a multi-modal query optimizer following a similar way is significantly time-consuming since we will have to enumerate as many multi-modal optimization rules as possible, which has not been well addressed today. In this paper, we investigate the query optimization ability of LLM and use LLM to design LaPuda, a novel LLM and Policy based multi-modal query optimizer. Instead of enumerating specific and detailed rules, LaPuda only needs a few abstract policies to guide LLM in the optimization, by which much time and human effort are saved. Furthermore, to prevent LLM from making mistakes or negative optimization, we borrow the idea of gradient descent and propose a guided cost descent (GCD) algorithm to perform the optimization, such that the optimization can be kept in the correct direction. In our evaluation, our methods consistently outperform the baselines in most cases. For example, the optimized plans generated by our methods result in 1~3x higher execution speed than those by the baselines.
comment: Yifan and Haodi contribute equally to the work
Systems and Control
☆ Risk-Calibrated Human-Robot Interaction via Set-Valued Intent Prediction
Tasks where robots must cooperate with humans, such as navigating around a cluttered home or sorting everyday items, are challenging because they exhibit a wide range of valid actions that lead to similar outcomes. Moreover, zero-shot cooperation between human-robot partners is an especially challenging problem because it requires the robot to infer and adapt on the fly to a latent human intent, which could vary significantly from human to human. Recently, deep learned motion prediction models have shown promising results in predicting human intent but are prone to being confidently incorrect. In this work, we present Risk-Calibrated Interactive Planning (RCIP), which is a framework for measuring and calibrating risk associated with uncertain action selection in human-robot cooperation, with the fundamental idea that the robot should ask for human clarification when the risk associated with the uncertainty in the human's intent cannot be controlled. RCIP builds on the theory of set-valued risk calibration to provide a finite-sample statistical guarantee on the cumulative loss incurred by the robot while minimizing the cost of human clarification in complex multi-step settings. Our main insight is to frame the risk control problem as a sequence-level multi-hypothesis testing problem, allowing efficient calibration using a low-dimensional parameter that controls a pre-trained risk-aware policy. Experiments across a variety of simulated and real-world environments demonstrate RCIP's ability to predict and adapt to a diverse set of dynamic human intents.
comment: Website with additional information, videos, and code: https://risk-calibrated-planning.github.io/
☆ Convection-Enabled Boundary Control of a 2D Channel Flow
We consider the incompressible Navier-Stokes equations in a two-dimensional channel. The tangential and normal velocities are assumed to be periodic in the streamwise (horizontal) direction. Moreover, we consider no-slip boundary conditions on the tangential velocity at the top and bottom walls of the channel, and normal velocity actuation at the top and bottom walls. For an arbitrarily large Reynolds number, we design the boundary control inputs to achieve global exponential stabilization, in the L2 sense, of a chosen parabolic Poiseuille profile. Moreover, we design the control inputs such that they have zero mean, but non-zero cubic mean. The zero-mean property is to ensure that the conservation of mass constraint is verified. The non-zero cubic mean property is the key to exploiting the stabilizing effect of nonlinear convection and achieving global stabilization independently of the size of the Reynolds number. This paper is not only the first work where a closed-form feedback law is proposed for global stabilization of parabolic Poiseuille profiles for arbitrary Reynolds number but is also the first generalization of the Cardano-Lyapunov formula, designed initially to stabilize scalar-valued convective PDEs, to a vector-valued convective PDE with a divergence-free constraint on the state.
comment: Submitted to the 63rd IEEE Conference on Decision and Control (CDC 2024)
☆ Perception and Control of Surfing in Virtual Reality using a 6-DoF Motion Platform
The paper presents a system for simulating surfing in Virtual Reality (VR), emphasizing the recreation of aquatic motions and user-initiated propulsive forces using a 6-Degree of Freedom (DoF) motion platform. We present an algorithmic approach to accurately render surfboard kinematics and interactive paddling dynamics, validated through experimental evaluation with \(N=17\) participants. Results indicate that the system effectively reproduces various acceleration levels, the perception of which is independent of users' body posture. We additionally found that the presence of ocean ripples amplifies the perception of acceleration. This system aims to enhance the realism and interactivity of VR surfing, laying a foundation for future advancements in surf therapy and interactive aquatic VR experiences.
☆ From Raw Data to Safety: Reducing Conservatism by Set Expansion
In response to safety concerns associated with learning-based algorithms, safety filters have been proposed as a modular technique. Generally, these filters heavily rely on the system's model, which is contradictory if they are intended to enhance a data-driven or end-to-end learning solution. This paper extends our previous work, a purely Data-Driven Safety Filter (DDSF) based on Willems' lemma, to an extremely short-sighted and non-conservative solution. Specifically, we propose online and offline sample-based methods to expand the safe set of DDSF and reduce its conservatism. Since this method is defined in an input-output framework, it can systematically handle both unknown and time-delay LTI systems using only one single batch of data. To evaluate its performance, we apply the proposed method to a time-delay system under various settings. The simulation results validate the effectiveness of the set expansion algorithm in generating a notably large input-output safe set, resulting in safety filters that are not conservative, even with an extremely short prediction horizon.
☆ A Modular Safety Filter for Safety-Certified Cyber-Physical Systems
Nowadays, many control systems are networked and embed communication and computation capabilities. Such control architectures are prone to cyber attacks on the cyberinfrastructure. Consequently, there is an impellent need to develop solutions to preserve the plant's safety against potential attacks. To ensure safety, this paper introduces a modular safety filter approach that is effective for a variety of cyber-attack types. This solution can be implemented in combination with existing control and detection algorithms, effectively separating safety from performance. The safety filter does not require information on the reliability of the received command or the feature of the used anomaly detector. It can be implemented in conjunction with high-performance, resilient controllers, to achieve both high performance during normal operation and safety during an attack. As an illustrative example, we have shown the effectiveness of the proposed design considering a multi-agent formation task involving 20 mobile robots. The simulation results testify that the safety filter operates effectively during false data injection and intelligent attacks.
☆ TJCCT: A Two-timescale Approach for UAV-assisted Mobile Edge Computing
Unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) is emerging as a promising paradigm to provide aerial-terrestrial computing services in close proximity to mobile devices (MDs). However, meeting the demands of computation-intensive and delay-sensitive tasks for MDs poses several challenges, including the demand-supply contradiction between MDs and MEC servers, the demand-supply heterogeneity between MDs and MEC servers, the trajectory control requirements on energy efficiency and timeliness, and the different time-scale dynamics of the network. To address these issues, we first present a hierarchical architecture by incorporating terrestrial-aerial computing capabilities and leveraging UAV flexibility. Furthermore, we formulate a joint computing resource allocation, computation offloading, and trajectory control problem to maximize the system utility. Since the problem is a non-convex and NP-hard mixed integer nonlinear programming (MINLP), we propose a two-timescale joint computing resource allocation, computation offloading, and trajectory control (TJCCT) approach for solving the problem. In the short timescale, we propose a price-incentive model for on-demand computing resource allocation and a matching mechanism-based method for computation offloading. In the long timescale, we propose a convex optimization-based method for UAV trajectory control. Besides, we theoretically prove the stability, optimality, and polynomial complexity of TJCCT. Extended simulation results demonstrate that the proposed TJCCT outperforms the comparative algorithms in terms of the system utility, average processing rate, average completion delay, and average completion ratio.
☆ Scaling Learning based Policy Optimization for Temporal Tasks via Dropout
This paper introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear environment. We desire the trained policy to ensure that the agent satisfies specific task objectives, expressed in discrete-time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the robustness, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback controllers, and we assume a feed forward neural network for learning these feedback controllers. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent's task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and na\"{i}ve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To tackle this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. We show that, the existing smooth semantics for robustness are inefficient regarding gradient computation when the specification becomes complex. To address this challenge, we propose a new smooth semantics for DT-STL that under-approximates the robustness value and scales well for backpropagation over a complex specification. We show that our control synthesis methodology, can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable backpropagation over long time horizons and trajectories over high dimensional state spaces.
☆ Semi-on-Demand Hybrid Transit Route Design with Shared Autonomous Mobility Services
This study examines the route design of a semi-on-demand hybrid route directional service in the public transit network, offering on-demand flexible route service in low-density areas and fixed route service in higher-density areas with Shared Autonomous Mobility Service (SAMS). The study develops analytically tractable cost expressions that capture access, waiting, and riding costs for users, and distance-based operating and time-based vehicle costs for operators. Two formulations are presented for strategic and tactical decisions in flexible route portion, fleet size, headway, and vehicle size optimization, enabling the determination of route types between fixed, hybrid, and flexible routes based on demand, cost, and operational parameters. The practical applications and benefits of semi-on-demand feeders are demonstrated with numerical examples and a large-scale case study in the Chicago metropolitan area. Findings reveal scenarios in which flexible route portions serving passengers located further away reduce total costs, particularly user costs. Lower operating costs in lower-demand areas favor more flexible routes, whereas higher demand densities favor more traditional line-based operations. On two studied lines, a current cost forecast favors smaller vehicles with flexible routes, but operating constraints and higher operating costs would favor bigger vehicles with hybrid routes. The study provides an analytical tool to design SAMS as directional services and transit feeders, and tractable continuous approximation formulations for future research in transit network design.
comment: 24 pages, 12 figures, the previous version presented at the 103rd Transportation Research Board Annual Meeting, Washington, D.C
☆ A Fairness-Oriented Reinforcement Learning Approach for the Operation and Control of Shared Micromobility Services
As Machine Learning systems become increasingly popular across diverse application domains, including those with direct human implications, the imperative of equity and algorithmic fairness has risen to prominence in the Artificial Intelligence community. On the other hand, in the context of Shared Micromobility Systems, the exploration of fairness-oriented approaches remains limited. Addressing this gap, we introduce a pioneering investigation into the balance between performance optimization and algorithmic fairness in the operation and control of Shared Micromobility Services. Our study leverages the Q-Learning algorithm in Reinforcement Learning, benefiting from its convergence guarantees to ensure the robustness of our proposed approach. Notably, our methodology stands out for its ability to achieve equitable outcomes, as measured by the Gini index, across different station categories--central, peripheral, and remote. Through strategic rebalancing of vehicle distribution, our approach aims to maximize operator performance while simultaneously upholding fairness principles for users. In addition to theoretical insights, we substantiate our findings with a case study or simulation based on synthetic data, validating the efficacy of our approach. This paper underscores the critical importance of fairness considerations in shaping control strategies for Shared Micromobility Services, offering a pragmatic framework for enhancing equity in urban transportation systems.
comment: 8 pages, 4 figures, submitted to the 63rd Conference on Decision and Control, Dec. 16-19, 2024, Milan, Italy
☆ Small Noise Analysis of Non-Parametric Closed-Loop Identification
We revisit the problem of non-parametric closed-loop identification in frequency domain; we give a brief survey of the literature and provide a small noise analysis of the direct, indirect, and joint input-output methods when two independent experiments with identical excitation are used. The analysis is asymptotic in the noise variance (i.e., as the standard deviation of the innovations $\sigma \to 0$), for a finite data record of length $N$. We highlight the relationship between the estimators accuracy and the loop shape via asymptotic variance expressions given in terms of the sensitivity function. The results are illustrated using a numerical simulation example.
☆ A Comparative Study of Artificial Potential Fields and Safety Filters
In this paper, we have demonstrated that the controllers designed by a classical motion planning tool, namely artificial potential fields (APFs), can be derived from a recently prevalent approach: control barrier function quadratic program (CBF-QP) safety filters. By integrating APF information into the CBF-QP framework, we establish a bridge between these two methodologies. Specifically, this is achieved by employing the attractive potential field as a control Lyapunov function (CLF) to guide the design of the nominal controller, and then the repulsive potential field serves as a reciprocal CBF (RCBF) to define a CBF-QP safety filter. Building on this integration, we extend the design of the CBF-QP safety filter to accommodate a more general class of dynamical models featuring a control-affine structure. This extension yields a special CBF-QP safety filter and a general APF solution suitable for control-affine dynamical models. Through a reach-avoid navigation example, we showcase the efficacy of the developed approaches.
☆ Distributed Robust Learning based Formation Control of Mobile Robots based on Bioinspired Neural Dynamics
This paper addresses the challenges of distributed formation control in multiple mobile robots, introducing a novel approach that enhances real-world practicability. We first introduce a distributed estimator using a variable structure and cascaded design technique, eliminating the need for derivative information to improve the real time performance. Then, a kinematic tracking control method is developed utilizing a bioinspired neural dynamic-based approach aimed at providing smooth control inputs and effectively resolving the speed jump issue. Furthermore, to address the challenges for robots operating with completely unknown dynamics and disturbances, a learning-based robust dynamic controller is developed. This controller provides real time parameter estimates while maintaining its robustness against disturbances. The overall stability of the proposed method is proved with rigorous mathematical analysis. At last, multiple comprehensive simulation studies have shown the advantages and effectiveness of the proposed method.
comment: This paper is accepted by IEEE Transactions on Intelligent Vehicles
☆ Causal Tracking of Distributions in Wasserstein Space: A Model Predictive Control Scheme
We consider the problem of optimal swarm tracking which can be formulated as a tracking problem for distributions in the Wasserstein metric. Optimal control solutions to this problem are non-causal and require knowing the time-trajectory of the distribution to be tracked in advance. We propose a scheme where these non-causal solutions can be used together with a predictive model for the reference to achieve causal tracking control of a priori-unknown references. We develop the resulting model-predictive control scheme in the simple case where the reference is predicted to be constant-in-time. A computational algorithm based on particle methods and discrete optimal mass transport is presented, and numerical simulations are provided for various classes of reference signals. The results demonstrate that the proposed control algorithm achieves reasonable performance even when using simple predictive models.
comment: 8 pages, 7 figures
☆ Improved Soft-k-Means Clustering Algorithm for Balancing Energy Consumption in Wireless Sensor Networks
Energy load balancing is an essential issue in designing wireless sensor networks (WSNs). Clustering techniques are utilized as energy-efficient methods to balance the network energy and prolong its lifetime. In this paper, we propose an improved soft-k-means (IS-k-means) clustering algorithm to balance the energy consumption of nodes in WSNs. First, we use the idea of ``clustering by fast search and find of density peaks'' (CFSFDP) and kernel density estimation (KDE) to improve the selection of the initial cluster centers of the soft k-means clustering algorithm. Then, we utilize the flexibility of the soft-k-means and reassign member nodes considering their membership probabilities at the boundary of clusters to balance the number of nodes per cluster. Furthermore, the concept of multi-cluster heads is employed to balance the energy consumption within clusters. {Extensive simulation results under different network scenarios demonstrate that for small-scale WSNs with single-hop transmission}, the proposed algorithm can postpone the first node death, the half of nodes death, and the last node death on average when compared to various clustering algorithms from the literature.
☆ Passivity-based Attack Identification and Mitigation with Event-triggered Observer Feedback and Switching Controller
This paper addresses the problem of output consensus in linear passive multi-agent systems under a False Data Injection (FDI) attack, considering the unavailability of complete state information. Our formulation relies on an event-based cryptographic authentication scheme for sensor integrity and considers FDI attacks at the actuator end, inspired by their practical nature and usages. For secure consensus, we propose (i) a passivity-based approach for detecting FDI attacks on the system and (ii) a Zeno-free event-triggered observer-based switching controller, which switches between the normal and the defense modes following an attack detection. We show that the closed-loop system achieves practical consensus under the controller's action in the defense mode. Simulation examples are provided to support the theoretical findings.
☆ Motion Planning for Identification of Linear Classifiers
A given region in 2-D Euclidean space is divided by a unknown linear classifier in to two sets each carrying a label. The objective of an agent with known dynamics traversing the region is to identify the true classifier while paying a control cost across its trajectory. We consider two scenarios: (i) the agent is able to measure the true label perfectly; (ii) the observed label is the true label multiplied by noise. We present the following: (i) the classifier identification problem formulated as a control problem; (ii) geometric interpretation of the control problem resulting in one step modified control problems; (iii) control algorithms that result in data sets which are used to identify the true classifier with accuracy; (iv) convergence of estimated classifier to the true classifier when the observed label is not corrupted by noise; (iv) numerical example demonstrating the utility of the control algorithms.
☆ On the role of network structure in learning to coordinate with bounded rationality
Many socioeconomic phenomena, such as technology adoption, collaborative problem-solving, and content engagement, involve a collection of agents coordinating to take a common action, aligning their decisions to maximize their individual goals. We consider a model for networked interactions where agents learn to coordinate their binary actions under a strict bound on their rationality. We first prove that our model is a potential game and that the optimal action profile is always to achieve perfect alignment at one of the two possible actions, regardless of the network structure. Using a stochastic learning algorithm known as Log Linear Learning, where agents have the same finite rationality parameter, we show that the probability of agents successfully agreeing on the correct decision is monotonically increasing in the number of network links. Therefore, more connectivity improves the accuracy of collective decision-making, as predicted by the phenomenon known as Wisdom of Crowds. Finally, we show that for a fixed number of links, a regular network maximizes the probability of success. We conclude that when using a network of irrational agents, promoting more homogeneous connectivity improves the accuracy of collective decision-making.
comment: Submitted to 2024 IEEE Conference on Decision and Control
☆ Safe and Stable Formation Control with Distributed Multi-Agents Using Adaptive Control and Control Barrier Functions
This manuscript considers the problem of ensuring stability and safety during formation control with distributed multi-agent systems in the presence of parametric uncertainty in the dynamics and limited communication. We propose an integrative approach that combines Control Barrier Functions, Adaptive Control, and connected graphs. A reference model is designed so as to ensure a safe and stable formation control strategy. This is combined with a provably correct adaptive control design that includes a use of a CBF-based safety filter that suitably generates safe reference commands, and employs error-based relaxation (EBR) of Nagumo's Invariance Theorem. Together, it is shown to lead to a guarantee of boundedness, formation control, and forward invariance. Numerical examples are provided to support the theoretical derivations.
☆ Real-Time Reconfiguration and Connectivity Maintenance for AUVs Network Under External Disturbances using Distributed Nonlinear Model Predictive Control
Advancements in underwater vehicle technology have significantly expanded the potential scope for deploying autonomous or remotely operated underwater vehicles in novel practical applications. However, the efficiency and maneuverability of these vehicles remain critical challenges, particularly in the dynamic aquatic environment. In this work, we propose a novel control scheme for creating multi-agent distributed formation control with limited communication between individual agents. In addition, the formation of the multi-agent can be reconfigured in real-time and the network connectivity can be maintained. The proposed use case for this scheme includes creating underwater mobile communication networks that can adapt to environmental or network conditions to maintain the quality of communication links for long-range exploration, seabed monitoring, or underwater infrastructure inspection. This work introduces a novel Distributed Nonlinear Model Predictive Control (DNMPC) strategy, integrating Control Lyapunov Functions (CLF) and Control Barrier Functions (CBF) with a relaxed decay rate, specifically tailored for 6-DOF underwater robotics. The effectiveness of our proposed DNMPC scheme was demonstrated through rigorous MATLAB simulations for trajectory tracking and formation reconfiguration in a dynamic environment. Our findings, supported by tests conducted using Software In The Loop (SITL) simulation, confirm the approach's applicability in real-time scenarios.
♻ ☆ On the impact of regularization in data-driven predictive control
Model predictive control (MPC) is a control strategy widely used in industrial applications. However, its implementation typically requires a mathematical model of the system being controlled, which can be a time-consuming and expensive task. Data-driven predictive control (DDPC) methods offer an alternative approach that does not require an explicit mathematical model, but instead optimize the control policy directly from data. In this paper, we study the impact of two different regularization penalties on the closed-loop performance of a recently introduced data-driven method called $\gamma$-DDPC. Moreover, we discuss the tuning of the related coefficients in different data and noise scenarios, to provide some guidelines for the end user.
comment: 7 pages, 3 figures
♻ ☆ Time-Varying Soft-Maximum Control Barrier Functions for Safety in an A Priori Unknown Environment
This paper presents a time-varying soft-maximum composite control barrier function (CBF) that can be used to ensure safety in an a priori unknown environment, where local perception information regarding the safe set is periodically obtained. We consider the scenario where the periodically obtained perception feedback can be used to construct a local CBF that models a local subset of the unknown safe set. Then, we use a novel smooth time-varying soft-maximum function to compose the N most recently obtained local CBFs into a single CBF. This composite CBF models an approximate union of the N most recently obtained local subsets of the safe set. Notably, this composite CBF can have arbitrary relative degree r. Next, this composite CBF is used as a rth-order CBF constraint in a real-time optimization to determine a control that minimizes a quadratic cost while guaranteeing that the state stays in a time-varying subset of the unknown safe set. We also present an application of the time-varying soft-maximum composite CBF method to a nonholonomic ground robot with nonnegligible inertia. In this application, we present a simple approach to generate the local CBFs from the periodically obtained perception data.
comment: 2024 American Control Conference (ACC)
♻ ☆ Hybrid Feedback Control for Global and Optimal Safe Navigation
We propose a hybrid feedback control strategy that safely steers a point-mass robot to a target location optimally from all initial conditions in the n-dimensional Euclidean space with a single spherical obstacle. The robot moves straight to the target when it has a clear line-of-sight to the target location. Otherwise, it engages in an optimal obstacle avoidance maneuver via the shortest path inside the cone enclosing the obstacle and having the robot's position as a vertex. The switching strategy that avoids the undesired equilibria, leading to global asymptotic stability (GAS) of the target location, relies on using two appropriately designed virtual destinations, ensuring control continuity and shortest path generation. Simulation results illustrating the effectiveness of the proposed approach are presented.
♻ ☆ Synthesis of Temporally-Robust Policies for Signal Temporal Logic Tasks using Reinforcement Learning ICRA 2024
This paper investigates the problem of designing control policies that satisfy high-level specifications described by signal temporal logic (STL) in unknown, stochastic environments. While many existing works concentrate on optimizing the spatial robustness of a system, our work takes a step further by also considering temporal robustness as a critical metric to quantify the tolerance of time uncertainty in STL. To this end, we formulate two relevant control objectives to enhance the temporal robustness of the synthesized policies. The first objective is to maximize the probability of being temporally robust for a given threshold. The second objective is to maximize the worst-case spatial robustness value within a bounded time shift. We use reinforcement learning to solve both control synthesis problems for unknown systems. Specifically, we approximate both control objectives in a way that enables us to apply the standard Q-learning algorithm. Theoretical bounds in terms of the approximations are also derived. We present case studies to demonstrate the feasibility of our approach.
comment: Accepted to ICRA 2024
♻ ☆ Topology Data Analysis-based Error Detection for Semantic Image Transmission with Incremental Knowledge-based HARQ
Semantic communication (SemCom) aims to achieve high fidelity information delivery under low communication consumption by only guaranteeing semantic accuracy. Nevertheless, semantic communication still suffers from unexpected channel volatility and thus developing a re-transmission mechanism (e.g., hybrid automatic repeat request [HARQ]) is indispensable. In that regard, instead of discarding previously transmitted information, the incremental knowledge-based HARQ (IK-HARQ) is deemed as a more effective mechanism that could sufficiently utilize the information semantics. However, considering the possible existence of semantic ambiguity in image transmission, a simple bit-level cyclic redundancy check (CRC) might compromise the performance of IK-HARQ. Therefore, it emerges a strong incentive to revolutionize the CRC mechanism, so as to reap the benefits of both SemCom and HARQ. In this paper, built on top of swin transformer-based joint source-channel coding (JSCC) and IK-HARQ, we propose a semantic image transmission framework SC-TDA-HARQ. In particular, different from the conventional CRC, we introduce a topological data analysis (TDA)-based error detection method, which capably digs out the inner topological and geometric information of images, so as to capture semantic information and determine the necessity for re-transmission. Extensive numerical results validate the effectiveness and efficiency of the proposed SC-TDA-HARQ framework, especially under the limited bandwidth condition, and manifest the superiority of TDA-based error detection method in image transmission.
♻ ☆ Hybrid path-lifting algorithm and Equivalence of Stability results for MRP-based control strategies
The modified Rodrigues parameters (MRP) consist of two numerically different triplets that, by switching between them, yield a minimal globally non-singular attitude description with advantageous properties. The MRP space results from the Alexandroff compactification of the three-dimensional Euclidean space and is a double cover of $\mathrm{SO(3)}$. By capitalizing on instrumental properties of the covering map, this paper proposes a novel hybrid dynamic path-lifting mechanism to unambiguously and robustly extract the MRP from the attitude space. This hybrid solution allows applying an MRP-based feedback controller to the attitude dynamics in the base space while preserving its asymptotic and exponential stability properties. Furthermore, by profiting from the distinct characteristics of the MRP, the resulting interconnection is impervious to the unwinding phenomenon. The design and validation of an MRP-based controller exemplify the application of the proposed algorithm alongside the novel results for equivalence of stability between spaces. The solution renders the attitude space tracking dynamics robustly globally exponentially stable, demonstrating the potential of this novel methodology.
♻ ☆ Performance-Guaranteed Solutions for Multi-Agent Optimal Coverage Problems using Submodularity, Curvature, and Greedy Algorithms
We consider a class of multi-agent optimal coverage problems in which the goal is to determine the optimal placement of a group of agents in a given mission space so that they maximize a coverage objective that represents a blend of individual and collaborative event detection capabilities. This class of problems is extremely challenging due to the non-convex nature of the mission space and of the coverage objective. With this motivation, greedy algorithms are often used as means of getting feasible coverage solutions efficiently. Even though such greedy solutions are suboptimal, the submodularity (diminishing returns) property of the coverage objective can be exploited to provide performance bound guarantees. Moreover, we show that improved performance bound guarantees (beyond the standard (1-1/e) performance bound) can be established using various curvature measures of the coverage problem. In particular, we provide a brief review of all existing popular applicable curvature measures, including a recent curvature measure that we proposed, and discuss their effectiveness and computational complexity, in the context of optimal coverage problems. We also propose novel computationally efficient techniques to estimate some curvature measures. Finally, we provide several numerical results to support our findings and propose several potential future research directions.
comment: Will be submitted to CDC 2024
Human-Computer Interaction
☆ Introduction to Human-Robot Interaction: A Multi-Perspective Introductory Course
In this paper I describe the design of an introductory course in Human-Robot Interaction. This project-driven course is designed to introduce undergraduate and graduate engineering students, especially those enrolled in Computer Science, Mechanical Engineering, and Robotics degree programs, to key theories and methods used in the field of Human-Robot Interaction that they would otherwise be unlikely to see in those degree programs. To achieve this aim, the course takes students all the way from stakeholder analysis to empirical evaluation, covering and integrating key Qualitative, Design, Computational, and Quantitative methods along the way. I detail the goals, audience, and format of the course, and provide a detailed walkthrough of the course syllabus.
comment: Presented at the Designing an Intro to HRI Course Workshop at HRI 2024 (arXiv:2403.05588)
☆ Visual Highlighting for Situated Brushing and Linking
Brushing and linking is widely used for visual analytics in desktop environments. However, using this approach to link many data items between situated (e.g., a virtual screen with data) and embedded views (e.g., highlighted objects in the physical environment) is largely unexplored. To this end, we study the effectiveness of visual highlighting techniques in helping users identify and link physical referents to brushed data marks in a situated scatterplot. In an exploratory virtual reality user study (N=20), we evaluated four highlighting techniques under different physical layouts and tasks. We discuss the effectiveness of these techniques, as well as implications for the design of brushing and linking operations in situated analytics.
comment: To appear in EuroVis 2024
☆ Blockchain-based Pseudonym Management for Vehicle Twin Migrations in Vehicular Edge Metaverse
Driven by the great advances in metaverse and edge computing technologies, vehicular edge metaverses are expected to disrupt the current paradigm of intelligent transportation systems. As highly computerized avatars of Vehicular Metaverse Users (VMUs), the Vehicle Twins (VTs) deployed in edge servers can provide valuable metaverse services to improve driving safety and on-board satisfaction for their VMUs throughout journeys. To maintain uninterrupted metaverse experiences, VTs must be migrated among edge servers following the movements of vehicles. This can raise concerns about privacy breaches during the dynamic communications among vehicular edge metaverses. To address these concerns and safeguard location privacy, pseudonyms as temporary identifiers can be leveraged by both VMUs and VTs to realize anonymous communications in the physical space and virtual spaces. However, existing pseudonym management methods fall short in meeting the extensive pseudonym demands in vehicular edge metaverses, thus dramatically diminishing the performance of privacy preservation. To this end, we present a cross-metaverse empowered dual pseudonym management framework. We utilize cross-chain technology to enhance management efficiency and data security for pseudonyms. Furthermore, we propose a metric to assess the privacy level and employ a Multi-Agent Deep Reinforcement Learning (MADRL) approach to obtain an optimal pseudonym generating strategy. Numerical results demonstrate that our proposed schemes are high-efficiency and cost-effective, showcasing their promising applications in vehicular edge metaverses.
comment: 14 pages, 9 figures
☆ (Un)making AI Magic: a Design Taxonomy
This paper examines the role that enchantment plays in the design of AI things by constructing a taxonomy of design approaches that increase or decrease the perception of magic and enchantment. We start from the design discourse surrounding recent developments in AI technologies, highlighting specific interaction qualities such as algorithmic uncertainties and errors and articulating relations to the rhetoric of magic and supernatural thinking. Through analyzing and reflecting upon 52 students' design projects from two editions of a Master course in design and AI, we identify seven design principles and unpack the effects of each in terms of enchantment and disenchantment. We conclude by articulating ways in which this taxonomy can be approached and appropriated by design/HCI practitioners, especially to support exploration and reflexivity.
☆ Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Modern language models, while sophisticated, exhibit some inherent shortcomings, particularly in conversational settings. We claim that many of the observed shortcomings can be attributed to violation of one or more conversational principles. By drawing upon extensive research from both the social science and AI communities, we propose a set of maxims -- quantity, quality, relevance, manner, benevolence, and transparency -- for describing effective human-AI conversation. We first justify the applicability of the first four maxims (from Grice) in the context of human-AI interactions. We then argue that two new maxims, benevolence (concerning the generation of, and engagement with, harmful content) and transparency (concerning recognition of one's knowledge boundaries, operational constraints, and intents), are necessary for addressing behavior unique to modern human-AI interactions. The proposed maxims offer prescriptive guidance on how to assess conversational quality between humans and LLM-driven conversational agents, informing both their evaluation and improved design.
☆ Extracting Human Attention through Crowdsourced Patch Labeling
In image classification, a significant problem arises from bias in the datasets. When it contains only specific types of images, the classifier begins to rely on shortcuts - simplistic and erroneous rules for decision-making. This leads to high performance on the training dataset but inferior results on new, varied images, as the classifier's generalization capability is reduced. For example, if the images labeled as mustache consist solely of male figures, the model may inadvertently learn to classify images by gender rather than the presence of a mustache. One approach to mitigate such biases is to direct the model's attention toward the target object's location, usually marked using bounding boxes or polygons for annotation. However, collecting such annotations requires substantial time and human effort. Therefore, we propose a novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias. Our method consists of two steps. First, we extract the approximate location of a target using a pre-trained saliency detection model supplemented by human verification for accuracy. Then, we determine the human-attentive area in the image by iteratively dividing the image into smaller patches and employing crowdsourcing to ascertain whether each patch can be classified as the target object. We demonstrated the effectiveness of our method in mitigating bias through improved classification accuracy and the refined focus of the model. Also, crowdsourced experiments validate that our method collects human annotation up to 3.4 times faster than annotating object locations with polygons, significantly reducing the need for human resources. We conclude the paper by discussing the advantages of our method in a crowdsourcing context, mainly focusing on aspects of human errors and accessibility.
comment: 21 pages, 11 figures
☆ Investigating Use Cases of AI-Powered Scene Description Applications for Blind and Low Vision People CHI2024
"Scene description" applications that describe visual content in a photo are useful daily tools for blind and low vision (BLV) people. Researchers have studied their use, but they have only explored those that leverage remote sighted assistants; little is known about applications that use AI to generate their descriptions. Thus, to investigate their use cases, we conducted a two-week diary study where 16 BLV participants used an AI-powered scene description application we designed. Through their diary entries and follow-up interviews, users shared their information goals and assessments of the visual descriptions they received. We analyzed the entries and found frequent use cases, such as identifying visual features of known objects, and surprising ones, such as avoiding contact with dangerous objects. We also found users scored the descriptions relatively low on average, 2.76 out of 5 (SD=1.49) for satisfaction and 2.43 out of 4 (SD=1.16) for trust, showing that descriptions still need significant improvements to deliver satisfying and trustworthy experiences. We discuss future opportunities for AI as it becomes a more powerful accessibility tool for BLV users.
comment: 21 pages, 18 figures, 5 tables, to appear CHI2024
☆ Augmented Reality Warnings in Roadway Work Zones: Evaluating the Effect of Modality on Worker Reaction Times
Given the aging highway infrastructure requiring extensive rebuilding and enhancements, and the consequent rise in the number of work zones, there is an urgent need to develop advanced safety systems to protect workers. While Augmented Reality (AR) holds significant potential for delivering warnings to workers, its integration into roadway work zones remains relatively unexplored. The primary objective of this study is to improve safety measures within roadway work zones by conducting an extensive analysis of how different combinations of multimodal AR warnings influence the reaction times of workers. This paper addresses this gap through a series of experiments that aim to replicate the distinctive conditions of roadway work zones, both in real-world and virtual reality environments. Our approach comprises three key components: an advanced AR system prototype, a VR simulation of AR functionality within the work zone environment, and the Wizard of Oz technique to synchronize user experiences across experiments. To assess reaction times, we leverage both the simple reaction time (SRT) technique and an innovative vision-based metric that utilizes real-time pose estimation. By conducting five experiments in controlled outdoor work zones and indoor VR settings, our study provides valuable information on how various multimodal AR warnings impact workers reaction times. Furthermore, our findings reveal the disparities in reaction times between VR simulations and real-world scenarios, thereby gauging VR's capability to mirror the dynamics of roadway work zones. Furthermore, our results substantiate the potential and reliability of vision-based reaction time measurements. These insights resonate well with those derived using the SRT technique, underscoring the viability of this approach for tangible real-world uses.
♻ ☆ LLMR: Real-time Prompting of Interactive Worlds using Large Language Models CHI 2024
We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.
comment: 46 pages, 18 figures; Matching version accepted at CHI 2024
♻ ☆ "This is not a data problem": Algorithms and Power in Public Higher Education in Canada CHI '24
Algorithmic decision-making is increasingly being adopted across public higher education. The expansion of data-driven practices by post-secondary institutions has occurred in parallel with the adoption of New Public Management approaches by neoliberal administrations. In this study, we conduct a qualitative analysis of an in-depth ethnographic case study of data and algorithms in use at a public college in Ontario, Canada. We identify the data, algorithms, and outcomes in use at the college. We assess how the college's processes and relationships support those outcomes and the different stakeholders' perceptions of the college's data-driven systems. In addition, we find that the growing reliance on algorithmic decisions leads to increased student surveillance, exacerbation of existing inequities, and the automation of the faculty-student relationship. Finally, we identify a cycle of increased institutional power perpetuated by algorithmic decision-making, and driven by a push towards financial sustainability.
comment: In CHI '24 Proceedings of the CHI Conference on Human Factors in Computing Systems Honolulu, HI, USA
♻ ☆ LaMI: Large Language Models for Multi-Modal Human-Robot Interaction
This paper presents an innovative large language model (LLM)-based robotic system for enhancing multi-modal human-robot interaction (HRI). Traditional HRI systems relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, our system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic actions" and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach. Supplementary material can be found at https://hri-eu.github.io/Lami/
comment: 10 pages, 6 figures
♻ ☆ HealMe: Harnessing Cognitive Reframing in Large Language Models for Psychotherapy
Large Language Models (LLMs) can play a vital role in psychotherapy by adeptly handling the crucial task of cognitive reframing and overcoming challenges such as shame, distrust, therapist skill variability, and resource scarcity. Previous LLMs in cognitive reframing mainly converted negative emotions to positive ones, but these approaches have limited efficacy, often not promoting clients' self-discovery of alternative perspectives. In this paper, we unveil the Helping and Empowering through Adaptive Language in Mental Enhancement (HealMe) model. This novel cognitive reframing therapy method effectively addresses deep-rooted negative thoughts and fosters rational, balanced perspectives. Diverging from traditional LLM methods, HealMe employs empathetic dialogue based on psychotherapeutic frameworks. It systematically guides clients through distinguishing circumstances from feelings, brainstorming alternative viewpoints, and developing empathetic, actionable suggestions. Moreover, we adopt the first comprehensive and expertly crafted psychological evaluation metrics, specifically designed to rigorously assess the performance of cognitive reframing, in both AI-simulated dialogues and real-world therapeutic conversations. Experimental results show that our model outperforms others in terms of empathy, guidance, and logical coherence, demonstrating its effectiveness and potential positive impact on psychotherapy.
comment: 17 pages, 4 figures
♻ ☆ Enhancing Worker Recruitment in Collaborative Mobile Crowdsourcing: A Graph Neural Network Trust Evaluation Approach
Collaborative Mobile Crowdsourcing (CMCS) allows platforms to recruit worker teams to collaboratively execute complex sensing tasks. The efficiency of such collaborations could be influenced by trust relationships among workers. To obtain the asymmetric trust values among all workers in the social network, the Trust Reinforcement Evaluation Framework (TREF) based on Graph Convolutional Neural Networks (GCNs) is proposed in this paper. The task completion effect is comprehensively calculated by considering the workers' ability benefits, distance benefits, and trust benefits in this paper. The worker recruitment problem is modeled as an Undirected Complete Recruitment Graph (UCRG), for which a specific Tabu Search Recruitment (TSR) algorithm solution is proposed. An optimal execution team is recruited for each task by the TSR algorithm, and the collaboration team for the task is obtained under the constraint of privacy loss. To enhance the efficiency of the recruitment algorithm on a large scale and scope, the Mini-Batch K-Means clustering algorithm and edge computing technology are introduced, enabling distributed worker recruitment. Lastly, extensive experiments conducted on five real datasets validate that the recruitment algorithm proposed in this paper outperforms other baselines. Additionally, TREF proposed herein surpasses the performance of state-of-the-art trust evaluation methods in the literature.
comment: The article has been accepted by IEEE TMC, and its DOI is 10.1109/TMC.2024.3373469
♻ ☆ AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers
For effective human-robot interaction, robots need to understand, plan, and execute complex, long-horizon tasks described by natural language. Recent advances in large language models (LLMs) have shown promise for translating natural language into robot action sequences for complex tasks. However, existing approaches either translate the natural language directly into robot trajectories or factor the inference process by decomposing language into task sub-goals and relying on a motion planner to execute each sub-goal. When complex environmental and temporal constraints are involved, inference over planning tasks must be performed jointly with motion plans using traditional task-and-motion planning (TAMP) algorithms, making factorization into subgoals untenable. Rather than using LLMs to directly plan task sub-goals, we instead perform few-shot translation from natural language task descriptions to an intermediate task representation that can then be consumed by a TAMP algorithm to jointly solve the task and motion plan. To improve translation, we automatically detect and correct both syntactic and semantic errors via autoregressive re-prompting, resulting in significant improvements in task completion. We show that our approach outperforms several methods using LLMs as planners in complex task domains. See our project website https://yongchao98.github.io/MIT-REALM-AutoTAMP/ for prompts, videos, and code.
comment: 8 pages, 4 figures
♻ ☆ User Training with Error Augmentation for Electromyogram-based Gesture Classification
We designed and tested a system for real-time control of a user interface by extracting surface electromyographic (sEMG) activity from eight electrodes in a wrist-band configuration. sEMG data were streamed into a machine-learning algorithm that classified hand gestures in real-time. After an initial model calibration, participants were presented with one of three types of feedback during a human-learning stage: veridical feedback, in which predicted probabilities from the gesture classification algorithm were displayed without alteration, modified feedback, in which we applied a hidden augmentation of error to these probabilities, and no feedback. User performance was then evaluated in a series of minigames, in which subjects were required to use eight gestures to manipulate their game avatar to complete a task. Experimental results indicated that, relative to baseline, the modified feedback condition led to significantly improved accuracy and improved gesture class separation. These findings suggest that real-time feedback in a gamified user interface with manipulation of feedback may enable intuitive, rapid, and accurate task acquisition for sEMG-based gesture recognition applications.
comment: 10 pages, 10 figures. V2: Fix latex characters in author name. V3: Add published DOI and Copyright notice
Social and Information Networks
☆ Closing the Information Gap in Unidentified Anomalous Phenomena (UAP) Studies
Unidentified Anomalous Phenomena (UAP), also known as Unidentified Flying Objects (UFOs), has shifted from being a stigmatized topic on the fringes of scientific inquiry to a legitimate subject of scientific interest with a need for high quality, curated data, and rigorous scientific investigation. This paper presents a preliminary scoping review and analysis of scholarly literature related to UAP from 1967 until 2023, exploring a diverse range of research areas across disciplines to illustrate scholarly discourse about the topic. The paper focuses on characterizing papers published in recent years and notes that Library & Information Science is unrepresented in the current UAP literature. The paper also discusses how researchers across the iFields can contribute to UAP studies through inherent expertise such as data curation and data science as well as information behavior and information literacy, among others. The paper concludes by emphasizing that UAP Studies offer a rich intellectual realm for information science research, with the iFields well positioned to play a crucial role in supporting and engaging in the study of UAP.
comment: iConference 2024
☆ Hierarchical Information Enhancement Network for Cascade Prediction in Social Networks
Understanding information cascades in networks is a fundamental issue in numerous applications. Current researches often sample cascade information into several independent paths or subgraphs to learn a simple cascade representation. However, these approaches fail to exploit the hierarchical semantic associations between different modalities, limiting their predictive performance. In this work, we propose a novel Hierarchical Information Enhancement Network (HIENet) for cascade prediction. Our approach integrates fundamental cascade sequence, user social graphs, and sub-cascade graph into a unified framework. Specifically, HIENet utilizes DeepWalk to sample cascades information into a series of sequences. It then gathers path information between users to extract the social relationships of propagators. Additionally, we employ a time-stamped graph convolutional network to aggregate sub-cascade graph information effectively. Ultimately, we introduce a Multi-modal Cascade Transformer to powerfully fuse these clues, providing a comprehensive understanding of cascading process. Extensive experiments have demonstrated the effectiveness of the proposed method.
comment: 7 pages, 2 figures
☆ Multi-perspective Memory Enhanced Network for Identifying Key Nodes in Social Networks
Identifying key nodes in social networks plays a crucial role in timely blocking false information. Existing key node identification methods usually consider node influence only from the propagation structure perspective and have insufficient generalization ability to unknown scenarios. In this paper, we propose a novel Multi-perspective Memory Enhanced Network (MMEN) for identifying key nodes in social networks, which mines key nodes from multiple perspectives and utilizes memory networks to store historical information. Specifically, MMEN first constructs two propagation networks from the perspectives of user attributes and propagation structure and updates node feature representations using graph attention networks. Meanwhile, the memory network is employed to store information of similar subgraphs, enhancing the model's generalization performance in unknown scenarios. Finally, MMEN applies adaptive weights to combine the node influence of the two propagation networks to select the ultimate key nodes. Extensive experiments demonstrate that our method significantly outperforms previous methods.
comment: 7 pages, 1 figures
☆ InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection AAAI
Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data that is useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.
comment: To appear at the 18th International AAAI Conference on Web and Social Media (ICWSM 2024) -- please cite accordingly
☆ LLM-Driven Agents for Influencer Selection in Digital Advertising Campaigns
In the digital world, influencers are pivotal as opinion leaders, shaping the views and choices of their influencees. Modern advertising often follows this trend, where marketers choose appropriate influencers for product endorsements, based on thorough market analysis. Previous studies on influencer selection have typically relied on numerical representations of individual opinions and interactions, a method that simplifies the intricacies of social dynamics. With the development of large language models (LLMs), we now have the opportunity to capture the nuanced exchanges of information within social networks. Hence, in this work, we first introduce an Influencer Dynamics Simulator (IDS), helping promoters identify and select the right influencers to market their products, based on LLM simulation. Concretely, we first propose an influencer-influencee engagement-based pre-selection module to screen potential influencer candidates. Subsequently, a simulation is constructed for these candidates and their influencees. Each user is represented as an LLM-based agent, drawing from their interaction history to deduce their profile and interests. The influencee agents will predict their behavior in response to influencer advertising. Finally, we develop a ranking metric designed to pinpoint influencers who are most likely to drive product purchases based on feedback from their influencees. To evaluate our framework, we collect a real-world advertising network dataset, including social relations, post and comment content, and user behaviors. Our dataset covers six products in diverse categories with their promoter influencers. Experiments show that IDS accurately identifies influencers from hundreds of candidates and provides valuable insights through detailed comments and specific attitudes.
☆ Ellipsoidal embeddings of graphs
Due to their flexibility to represent almost any kind of relational data, graph-based models have enjoyed a tremendous success over the past decades. While graphs are inherently only combinatorial objects, however, many prominent analysis tools are based on the algebraic representation of graphs via matrices such as the graph Laplacian, or on associated graph embeddings. Such embeddings associate to each node a set of coordinates in a vector space, a representation which can then be employed for learning tasks such as the classification or alignment of the nodes of the graph. As the geometric picture provided by embedding methods enables the use of a multitude of methods developed for vector space data, embeddings have thus gained interest both from a theoretical as well as a practical perspective. Inspired by trace-optimization problems, often encountered in the analysis of graph-based data, here we present a method to derive ellipsoidal embeddings of the nodes of a graph, in which each node is assigned a set of coordinates on the surface of a hyperellipsoid. Our method may be seen as an alternative to popular spectral embedding techniques, to which it shares certain similarities we discuss. To illustrate the utility of the embedding we conduct a case study in which analyse synthetic and real world networks with modular structure, and compare the results obtained with known methods in the literature.
comment: 29 pages, 6 figures
☆ Reconstructing the evolution history of networked complex systems
The evolution processes of complex systems carry key information in the systems' functional properties. Applying machine learning algorithms, we demonstrate that the historical formation process of various networked complex systems can be extracted, including protein-protein interaction, ecology, and social network systems. The recovered evolution process has demonstrations of immense scientific values, such as interpreting the evolution of protein-protein interaction network, facilitating structure prediction, and particularly revealing the key co-evolution features of network structures such as preferential attachment, community structure, local clustering, degree-degree correlation that could not be explained collectively by previous theories. Intriguingly, we discover that for large networks, if the performance of the machine learning model is slightly better than a random guess on the pairwise order of links, reliable restoration of the overall network formation process can be achieved. This suggests that evolution history restoration is generally highly feasible on empirical networks.
☆ Simple Graph Condensation
The burdensome training costs on large-scale graphs have aroused significant interest in graph condensation, which involves tuning Graph Neural Networks (GNNs) on a small condensed graph for use on the large-scale original graph. Existing methods primarily focus on aligning key metrics between the condensed and original graphs, such as gradients, distribution and trajectory of GNNs, yielding satisfactory performance on downstream tasks. However, these complex metrics necessitate intricate computations and can potentially disrupt the optimization process of the condensation graph, making the condensation process highly demanding and unstable. Motivated by the recent success of simplified models in various fields, we propose a simplified approach to metric alignment in graph condensation, aiming to reduce unnecessary complexity inherited from GNNs. In our approach, we eliminate external parameters and exclusively retain the target condensed graph during the condensation process. Following the hierarchical aggregation principles of GNNs, we introduce the Simple Graph Condensation (SimGC) framework, which aligns the condensed graph with the original graph from the input layer to the prediction layer, guided by a pre-trained Simple Graph Convolution (SGC) model on the original graph. As a result, both graphs possess the similar capability to train GNNs. This straightforward yet effective strategy achieves a significant speedup of up to 10 times compared to existing graph condensation methods while performing on par with state-of-the-art baselines. Comprehensive experiments conducted on seven benchmark datasets demonstrate the effectiveness of SimGC in prediction accuracy, condensation time, and generalization capability. Our code will be made publicly available.
comment: Under review
☆ Unraveling Contagion Origins: Optimal Estimation through Maximum-Likelihood and Starlike Tree Approximation in Markovian Spreading Models
Identifying the source of epidemic-like spread in networks is crucial for tasks like removing internet viruses or finding the rumor source in online social networks. The challenge lies in tracing the source from a snapshot observation of infected nodes. How do we accurately pinpoint the source? Utilizing snapshot data, we apply a probabilistic approach, focusing on the graph boundary and the observed time, to detect sources via an effective maximum likelihood algorithm. A novel starlike tree approximation extends applicability to general graphs, demonstrating versatility. We highlight the utility of the Gamma function for analyzing the asymptotic behavior of the likelihood ratio between nodes. Comprehensive evaluations confirm algorithmic effectiveness in diverse network scenarios, advancing rumor source detection in large-scale network analysis and information dissemination strategies.
☆ Nonparametric inference of higher order interaction patterns in networks
We propose a method for obtaining parsimonious decompositions of networks into higher order interactions which can take the form of arbitrary motifs.The method is based on a class of analytically solvable generative models, where vertices are connected via explicit copies of motifs, which in combination with non-parametric priors allow us to infer higher order interactions from dyadic graph data without any prior knowledge on the types or frequencies of such interactions. Crucially, we also consider 'degree--corrected' models that correctly reflect the degree distribution of the network and consequently prove to be a better fit for many real world--networks compared to non-degree corrected models. We test the presented approach on simulated data for which we recover the set of underlying higher order interactions to a high degree of accuracy. For empirical networks the method identifies concise sets of atomic subgraphs from within thousands of candidates that cover a large fraction of edges and include higher order interactions of known structural and functional significance. The method not only produces an explicit higher order representation of the network but also a fit of the network to analytically tractable models opening new avenues for the systematic study of higher order network structures.
comment: 18 pages, 3 figures
♻ ☆ Identifying hubs in directed networks
Nodes in networks that exhibit high connectivity, also called ``hubs'', play a critical role in determining the structural and functional properties of networked systems. However, there is no clear definition of what constitutes a hub node in a network, and the classification of network hubs in existing work has either been purely qualitative or relies on ad hoc criteria for thresholding continuous data that do not generalize well to networks with certain degree sequences. Here we develop a set of efficient nonparametric methods that classify hub nodes in directed networks using the Minimum Description Length principle, effectively providing a clear and principled definition for network hubs. We adapt our methods to both unweighted and weighted networks and demonstrate them in a range of example applications using real and synthetic network data.
♻ ☆ Enhancing Worker Recruitment in Collaborative Mobile Crowdsourcing: A Graph Neural Network Trust Evaluation Approach
Collaborative Mobile Crowdsourcing (CMCS) allows platforms to recruit worker teams to collaboratively execute complex sensing tasks. The efficiency of such collaborations could be influenced by trust relationships among workers. To obtain the asymmetric trust values among all workers in the social network, the Trust Reinforcement Evaluation Framework (TREF) based on Graph Convolutional Neural Networks (GCNs) is proposed in this paper. The task completion effect is comprehensively calculated by considering the workers' ability benefits, distance benefits, and trust benefits in this paper. The worker recruitment problem is modeled as an Undirected Complete Recruitment Graph (UCRG), for which a specific Tabu Search Recruitment (TSR) algorithm solution is proposed. An optimal execution team is recruited for each task by the TSR algorithm, and the collaboration team for the task is obtained under the constraint of privacy loss. To enhance the efficiency of the recruitment algorithm on a large scale and scope, the Mini-Batch K-Means clustering algorithm and edge computing technology are introduced, enabling distributed worker recruitment. Lastly, extensive experiments conducted on five real datasets validate that the recruitment algorithm proposed in this paper outperforms other baselines. Additionally, TREF proposed herein surpasses the performance of state-of-the-art trust evaluation methods in the literature.
comment: The article has been accepted by IEEE TMC, and its DOI is 10.1109/TMC.2024.3373469
♻ ☆ Multiscale Hodge Scattering Networks for Data Analysis
We propose new scattering networks for signals measured on simplicial complexes, which we call \emph{Multiscale Hodge Scattering Networks} (MHSNs). Our construction is based on multiscale basis dictionaries on simplicial complexes, i.e., the $\kappa$-GHWT and $\kappa$-HGLET, which we recently developed for simplices of dimension $\kappa \in \mathbb{N}$ in a given simplicial complex by generalizing the node-based Generalized Haar-Walsh Transform (GHWT) and Hierarchical Graph Laplacian Eigen Transform (HGLET). The $\kappa$-GHWT and the $\kappa$-HGLET both form redundant sets (i.e., dictionaries) of multiscale basis vectors and the corresponding expansion coefficients of a given signal. Our MHSNs use a layered structure analogous to a convolutional neural network (CNN) to cascade the moments of the modulus of the dictionary coefficients. The resulting features are invariant to reordering of the simplices (i.e., node permutation of the underlying graphs). Importantly, the use of multiscale basis dictionaries in our MHSNs admits a natural pooling operation that is akin to local pooling in CNNs, and which may be performed either locally or per-scale. These pooling operations are harder to define in both traditional scattering networks based on Morlet wavelets, and geometric scattering networks based on Diffusion Wavelets. As a result, we are able to extract a rich set of descriptive yet robust features that can be used along with very simple machine learning methods (i.e., logistic regression or support vector machines) to achieve high-accuracy classification systems with far fewer parameters to train than most modern graph neural networks. Finally, we demonstrate the usefulness of our MHSNs in three distinct types of problems: signal classification, domain (i.e., graph/simplex) classification, and molecular dynamics prediction.
comment: 20 Pages, Comments Welcome
♻ ☆ The latent cognitive structures of social networks
When people are asked to recall their social networks, theoretical and empirical work tells us that they rely on shortcuts, or heuristics. Cognitive Social Structures (CSS) are multilayer social networks where each layer corresponds to an individual's perception of the network. With multiple perceptions of the same network, CSSs contain rich information about how these heuristics manifest, motivating the question, Can we identify people who share the same heuristics? In this work, we propose a method for identifying cognitive structure across multiple network perceptions, analogous to how community detection aims to identify social structure in a network. To simultaneously model the joint latent social and cognitive structure, we study CSSs as three-dimensional tensors, employing low-rank nonnegative Tucker decompositions (NNTuck) to approximate the CSS--a procedure closely related to estimating a multilayer stochastic block model (SBM) from such data. We propose the resulting latent cognitive space as an operationalization of the sociological theory of social cognition by identifying individuals who share relational schema. In addition to modeling cognitively independent, dependent, and redundant networks, we propose a specific model instance and related statistical test for testing when there is social-cognitive agreement in a network: when the social and cognitive structures are equivalent. We use our approach to analyze four different CSSs and give insights into the latent cognitive structures of those networks.
Distributed, Parallel, and Cluster Computing
☆ Fourier Transform-based Estimators for Data Sketches
In this paper we consider the problem of estimating the $f$-moment ($\sum_{v\in [n]} (f(\mathbf{x}(v))-f(0))$) of a dynamic vector $\mathbf{x}\in \mathbb{G}^n$ over some abelian group $(\mathbb{G},+)$, where the $\|f\|_\infty$ norm is bounded. We propose a simple sketch and new estimation framework based on the \emph{Fourier transform} of $f$. I.e., we decompose $f$ into a linear combination of homomorphisms $f_1,f_2,\ldots$ from $(\mathbb{G},+)$ to $(\mathbb{C},\times)$, estimate the $f_k$-moment for each $f_k$, and synthesize them to obtain an estimate of the $f$-moment. Our estimators are asymptotically unbiased and have variance asymptotic to $\|\mathbf{x}\|_0^2 (\|f\|_\infty^2 m^{-1} + \|\hat{f}\|_1^2 m^{-2})$, where the size of the sketch is $O(m\log n\log|\mathbb{G}|)$ bits. When $\mathbb{G}=\mathbb{Z}$ this problem can also be solved using off-the-shelf $\ell_0$-samplers with space $O(m\log^2 n)$ bits, which does not obviously generalize to finite groups. As a concrete benchmark, we extend Ganguly, Garofalakis, and Rastogi's singleton-detector-based sampler to work over $\mathbb{G}$ using $O(m\log n\log|\mathbb{G}|\log(m\log n))$ bits. We give some experimental evidence that the Fourier-based estimation framework is significantly more accurate than sampling-based approaches at the same memory footprint.
☆ ProvDeploy: Provenance-oriented Containerization of High Performance Computing Scientific Workflows
Many existing scientific workflows require High Performance Computing environments to produce results in a timely manner. These workflows have several software library components and use different environments, making the deployment and execution of the software stack not trivial. This complexity increases if the user needs to add provenance data capture services to the workflow. This manuscript introduces ProvDeploy to assist the user in configuring containers for scientific workflows with integrated provenance data capture. ProvDeploy was evaluated with a Scientific Machine Learning workflow, exploring containerization strategies focused on provenance in two distinct HPC environments
☆ FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication ICDE
Serverless computing offers attractive scalability, elasticity and cost-effectiveness. However, constraints on memory, CPU and function runtime have hindered its adoption for data-intensive applications and machine learning (ML) workloads. Traditional 'server-ful' platforms enable distributed computation via fast networks and well-established inter-process communication (IPC) mechanisms such as MPI and shared memory. In the absence of such solutions in the serverless domain, parallel computation with significant IPC requirements is challenging. We present FSD-Inference, the first fully serverless and highly scalable system for distributed ML inference. We explore potential communication channels, in conjunction with Function-as-a-Service (FaaS) compute, to design a state-of-the-art solution for distributed ML within the context of serverless data-intensive computing. We introduce novel fully serverless communication schemes for ML inference workloads, leveraging both cloud-based publish-subscribe/queueing and object storage offerings. We demonstrate how publish-subscribe/queueing services can be adapted for FaaS IPC with comparable performance to object storage, while offering significantly reduced cost at high parallelism levels. We conduct in-depth experiments on benchmark DNNs of various sizes. The results show that when compared to server-based alternatives, FSD-Inference is significantly more cost-effective and scalable, and can even achieve competitive performance against optimized HPC solutions. Experiments also confirm that our serverless solution can handle large distributed workloads and leverage high degrees of FaaS parallelism.
comment: In Proceedings of 2024 IEEE 40th International Conference on Data Engineering (ICDE) (to appear)
☆ ECHO: Efficient Off-Chain Payments and Cross-Chain Swaps for Cryptocurrencies
In this paper, we present ECHO, a TEE-based layer-2 solution that tackles two crucial challenges in the realm of cryptocurrencies: off-chain payments and cross-chain swaps. It offers three notable features: - Channel-free off-chain payments: it allows a payer to make direct payments to anyone without requiring any on-chain relationship or intermediary channels. - Real-time yet decentralized cross-chain swaps: it is the first known solution that enables real-time cross-chain swaps without relying on a central server. This novel feature is made possible through a ground-breaking fair exchange protocol. - TEE crash-tolerance: it offers two solutions to handle TEE crashes, one of which involves an innovative application of time-lock puzzles in this context. We evaluate ECHO on a network consists of 1000 nodes and the evaluation results show that ECHO can achieve 7000 TPS
☆ Modeling Distributed Computing Infrastructures for HEP Applications
Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area network that support a plethora of users and workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales merely for the purpose of comparing and evaluating alternate designs. An alternative is to study the behaviours of these systems using simulation. This approach has been used successfully in the past to identify efficient and practical infrastructure designs for High Energy Physics (HEP). A prominent example is the Monarc simulation framework, which was used to study the initial structure of the WLCG. New simulation capabilities are needed to simulate large-scale heterogeneous computing systems with complex networks, data access and caching patterns. A modern tool to simulate HEP workloads that execute on distributed computing infrastructures based on the SimGrid and WRENCH simulation frameworks is outlined. Studies of its accuracy and scalability are presented using HEP as a case-study. Hypothetical adjustments to prevailing computing architectures in HEP are studied providing insights into the dynamics of a part of the WLCG and candidates for improvements.
comment: To appear in Proceedings of the 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP2023)
☆ Fast real-time arbitrary waveform generation using graphic processing units
Real-time Arbitrary Waveform Generation (AWG) is essential in various engineering and research applications, and often requires complex bespoke hardware and software. This paper introduces an AWG framework using an NVIDIA Graphics Processing Unit (GPU) and a commercially available high-speed Digital-to-Analog Converter (DAC) card, both running on a desktop personal computer (PC). The GPU accelerates the "embarrassingly" data parallel additive waveform synthesis framework for AWG, and the DAC reconstructs the generated waveform in the analog domain at high speed. The AWG framework is programmed using the developer-friendly Compute Unified Device Architecture (CUDA) runtime application programming interface from NVIDIA and is readily customizable, and scalable with additional parallel hardware. We present and characterize two different pathways for computing modulated radio-frequency (rf) waveforms: one pathway offers high-complexity simultaneous chirping of 1000 individual Nyquist-limited single-frequency tones for 35 ms at a sampling rate of 560 MB/s, and the other pathway allows simultaneous continuous chirping of 194 individual Nyquist-limited single-frequency tones at 100 MB/s, or 20 individual tones at 560 MB/s. This AWG framework is designed for fast on-the-fly rearrangement of a large stochastically-loaded optical tweezer array of single atoms or molecules into a defect-free array needed for quantum simulation and quantum computation applications.
comment: 13 pages, 10 figures
♻ ☆ Sparse Mean Field Load Balancing in Large Localized Queueing Systems
Scalable load balancing algorithms are of great interest in cloud networks and data centers, necessitating the use of tractable techniques to compute optimal load balancing policies for good performance. However, most existing scalable techniques, especially asymptotically scaling methods based on mean field theory, have not been able to model large queueing networks with strong locality. Meanwhile, general multi-agent reinforcement learning techniques can be hard to scale and usually lack a theoretical foundation. In this work, we address this challenge by leveraging recent advances in sparse mean field theory to learn a near-optimal load balancing policy in sparsely connected queueing networks in a tractable manner, which may be preferable to global approaches in terms of wireless communication overhead. Importantly, we obtain a general load balancing framework for a large class of sparse bounded-degree wireless topologies. By formulating a novel mean field control problem in the context of graphs with bounded degree, we reduce the otherwise difficult multi-agent problem to a single-agent problem. Theoretically, the approach is justified by approximation guarantees. Empirically, the proposed methodology performs well on several realistic and scalable wireless network topologies as compared to a number of well-known load balancing heuristics and existing scalable multi-agent reinforcement learning methods.
♻ ☆ No distributed quantum advantage for approximate graph coloring STOC 2024
We give an almost complete characterization of the hardness of $c$-coloring $\chi$-chromatic graphs with distributed algorithms, for a wide range of models of distributed computing. In particular, we show that these problems do not admit any distributed quantum advantage. To do that: 1) We give a new distributed algorithm that finds a $c$-coloring in $\chi$-chromatic graphs in $\tilde{\mathcal{O}}(n^{\frac{1}{\alpha}})$ rounds, with $\alpha = \bigl\lfloor\frac{c-1}{\chi - 1}\bigr\rfloor$. 2) We prove that any distributed algorithm for this problem requires $\Omega(n^{\frac{1}{\alpha}})$ rounds. Our upper bound holds in the classical, deterministic LOCAL model, while the near-matching lower bound holds in the non-signaling model. This model, introduced by Arfaoui and Fraigniaud in 2014, captures all models of distributed graph algorithms that obey physical causality; this includes not only classical deterministic LOCAL and randomized LOCAL but also quantum-LOCAL, even with a pre-shared quantum state. We also show that similar arguments can be used to prove that, e.g., 3-coloring 2-dimensional grids or $c$-coloring trees remain hard problems even for the non-signaling model, and in particular do not admit any quantum advantage. Our lower-bound arguments are purely graph-theoretic at heart; no background on quantum information theory is needed to establish the proofs.
comment: Accepted to STOC 2024
♻ ☆ LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept of this approach. While such a proof-of-concept has shown promise, it still has significant limitations. State-of-the-art polyhedral compilers that use a deep-learning cost model only support a small subset of affine transformations, limiting their ability to apply complex code transformations. They also only support simple programs that have a single loop nest and a rectangular iteration domain, limiting their applicability to many programs. These limitations significantly impact the generality of such compilers and autoschedulers and put into question the whole approach. In this paper, we introduce LOOPer, the first polyhedral autoscheduler that uses a deep-learning based cost model and covers a large set of affine transformations and programs. It supports the exploration of a large set of affine transformations, allowing the application of complex sequences of polyhedral transformations. It also supports the optimization of programs with multiple loop nests and with rectangular and non-rectangular iteration domains, allowing the optimization of an extensive set of programs. We implement and evaluate LOOPer and show that it achieves speedups over the state-of-the-art. On the Polybench benchmark, LOOPer achieves a geometric mean speedup of 1.59x over Tiramisu. LOOPer also achieves competitive speedups with a geometric mean speedup of 1.34x over Pluto, a state-of-the-art polyhedral compiler that does not use a machine-learning based cost model.
Machine Learning
☆ DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data CVPR 2024
Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/.
comment: The paper is accepted by CVPR 2024
☆ LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis
Recent text-to-3D generation approaches produce impressive 3D results but require time-consuming optimization that can take up to an hour per prompt. Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis. However, they cannot capture high-frequency geometry and texture details and struggle to scale to large prompt sets, so they generalize poorly. We introduce LATTE3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set. Key to our method is 1) building a scalable architecture and 2) leveraging 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. LATTE3D amortizes both neural field and textured surface generation to produce highly detailed textured meshes in a single forward pass. LATTE3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.
comment: See the project website at https://research.nvidia.com/labs/toronto-ai/LATTE3D/
☆ Can large language models explore in-context?
We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.
☆ Augmented Reality based Simulated Data (ARSim) with multi-view consistency for AV perception networks
Detecting a diverse range of objects under various driving scenarios is essential for the effectiveness of autonomous driving systems. However, the real-world data collected often lacks the necessary diversity presenting a long-tail distribution. Although synthetic data has been utilized to overcome this issue by generating virtual scenes, it faces hurdles such as a significant domain gap and the substantial efforts required from 3D artists to create realistic environments. To overcome these challenges, we present ARSim, a fully automated, comprehensive, modular framework designed to enhance real multi-view image data with 3D synthetic objects of interest. The proposed method integrates domain adaptation and randomization strategies to address covariate shift between real and simulated data by inferring essential domain attributes from real data and employing simulation-based randomization for other attributes. We construct a simplified virtual scene using real data and strategically place 3D synthetic assets within it. Illumination is achieved by estimating light distribution from multiple images capturing the surroundings of the vehicle. Camera parameters from real data are employed to render synthetic assets in each frame. The resulting augmented multi-view consistent dataset is used to train a multi-camera perception network for autonomous vehicles. Experimental results on various AV perception tasks demonstrate the superior performance of networks trained on the augmented dataset.
comment: 17 pages, 15 figures, 7 tables
☆ A Transfer Attack to Image Watermarks
Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In particular, multiple studies claimed that image watermark is robust in such setting. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
☆ Cascading Blackout Severity Prediction with Statistically-Augmented Graph Neural Networks SC
Higher variability in grid conditions, resulting from growing renewable penetration and increased incidence of extreme weather events, has increased the difficulty of screening for scenarios that may lead to catastrophic cascading failures. Traditional power-flow-based tools for assessing cascading blackout risk are too slow to properly explore the space of possible failures and load/generation patterns. We add to the growing literature of faster graph-neural-network (GNN)-based techniques, developing two novel techniques for the estimation of blackout magnitude from initial grid conditions. First we propose several methods for employing an initial classification step to filter out safe "non blackout" scenarios prior to magnitude estimation. Second, using insights from the statistical properties of cascading blackouts, we propose a method for facilitating non-local message passing in our GNN models. We validate these two approaches on a large simulated dataset, and show the potential of both to increase blackout size estimation performance.
comment: Accepted to Power Systems Computation Conference (PSCC) 2024
☆ Learning Topological Representations for Deep Image Understanding
In many scenarios, especially biomedical applications, the correct delineation of complex fine-scaled structures such as neurons, tissues, and vessels is critical for downstream analysis. Despite the strong predictive power of deep learning methods, they do not provide a satisfactory representation of these structures, thus creating significant barriers in scalable annotation and downstream analysis. In this dissertation, we tackle such challenges by proposing novel representations of these topological structures in a deep learning framework. We leverage the mathematical tools from topological data analysis, i.e., persistent homology and discrete Morse theory, to develop principled methods for better segmentation and uncertainty estimation, which will become powerful tools for scalable annotation.
comment: Ph.D. thesis from Stony Brook University. This thesis includes works arXiv:1906.05404, arXiv:2110.08335, arXiv:2112.07812, arXiv:2103.09992, arXiv:2206.01742
☆ SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~\url{https://github.com/badripatro/Simba}.
☆ Ultrasound Imaging based on the Variance of a Diffusion Restoration Model
Despite today's prevalence of ultrasound imaging in medicine, ultrasound signal-to-noise ratio is still affected by several sources of noise and artefacts. Moreover, enhancing ultrasound image quality involves balancing concurrent factors like contrast, resolution, and speckle preservation. Recently, there has been progress in both model-based and learning-based approaches addressing the problem of ultrasound image reconstruction. Bringing the best from both worlds, we propose a hybrid reconstruction method combining an ultrasound linear direct model with a learning-based prior coming from a generative Denoising Diffusion model. More specifically, we rely on the unsupervised fine-tuning of a pre-trained Denoising Diffusion Restoration Model (DDRM). Given the nature of multiplicative noise inherent to ultrasound, this paper proposes an empirical model to characterize the stochasticity of diffusion reconstruction of ultrasound images, and shows the interest of its variance as an echogenicity map estimator. We conduct experiments on synthetic, in-vitro, and in-vivo data, demonstrating the efficacy of our variance imaging approach in achieving high-quality image reconstructions from single plane-wave acquisitions and in comparison to state-of-the-art methods.
comment: 5 pages; submitted to EUSIPCO 2024. arXiv admin note: text overlap with arXiv:2310.20618
☆ A Wasserstein perspective of Vanilla GANs
The empirical success of Generative Adversarial Networks (GANs) caused an increasing interest in theoretical research. The statistical literature is mainly focused on Wasserstein GANs and generalizations thereof, which especially allow for good dimension reduction properties. Statistical results for Vanilla GANs, the original optimization problem, are still rather limited and require assumptions such as smooth activation functions and equal dimensions of the latent space and the ambient space. To bridge this gap, we draw a connection from Vanilla GANs to the Wasserstein distance. By doing so, existing results for Wasserstein GANs can be extended to Vanilla GANs. In particular, we obtain an oracle inequality for Vanilla GANs in Wasserstein distance. The assumptions of this oracle inequality are designed to be satisfied by network architectures commonly used in practice, such as feedforward ReLU networks. By providing a quantitative result for the approximation of a Lipschitz function by a feedforward ReLU network with bounded H\"older norm, we conclude a rate of convergence for Vanilla GANs as well as Wasserstein GANs as estimators of the unknown probability distribution.
☆ Controlled Training Data Generation with Diffusion Models
In this work, we present a method to control a text-to-image generative model to produce training data specifically "useful" for supervised learning. Unlike previous works that employ an open-loop approach and pre-define prompts to generate new data using either a language model or human expertise, we develop an automated closed-loop system which involves two feedback mechanisms. The first mechanism uses feedback from a given supervised model and finds adversarial prompts that result in image generations that maximize the model loss. While these adversarial prompts result in diverse data informed by the model, they are not informed of the target distribution, which can be inefficient. Therefore, we introduce the second feedback mechanism that guides the generation process towards a certain target distribution. We call the method combining these two mechanisms Guided Adversarial Prompts. We perform our evaluations on different tasks, datasets and architectures, with different types of distribution shifts (spuriously correlated data, unseen domains) and demonstrate the efficiency of the proposed feedback mechanisms compared to open-loop approaches.
comment: Project page at https://adversarial-prompts.epfl.ch/
☆ KTbench: A Novel Data Leakage-Free Framework for Knowledge Tracing
Knowledge Tracing (KT) is concerned with predicting students' future performance on learning items in intelligent tutoring systems. Learning items are tagged with skill labels called knowledge concepts (KCs). Many KT models expand the sequence of item-student interactions into KC-student interactions by replacing learning items with their constituting KCs. This often results in a longer sequence length. This approach addresses the issue of sparse item-student interactions and minimises model parameters. However, two problems have been identified with such models. The first problem is the model's ability to learn correlations between KCs belonging to the same item, which can result in the leakage of ground truth labels and hinder performance. This problem can lead to a significant decrease in performance on datasets with a higher number of KCs per item. The second problem is that the available benchmark implementations ignore accounting for changes in sequence length when expanding KCs, leading to different models being tested with varying sequence lengths but still compared against the same benchmark. To address these problems, we introduce a general masking framework that mitigates the first problem and enhances the performance of such KT models while preserving the original model architecture without significant alterations. Additionally, we introduce KTbench, an open-source benchmark library designed to ensure the reproducibility of this work while mitigating the second problem.
comment: preprint
☆ Planning with a Learned Policy Basis to Optimally Solve Complex Tasks
Conventional reinforcement learning (RL) methods can successfully solve a wide range of sequential decision problems. However, learning policies that can generalize predictably across multiple tasks in a setting with non-Markovian reward specifications is a challenging problem. We propose to use successor features to learn a policy basis so that each (sub)policy in it solves a well-defined subproblem. In a task described by a finite state automaton (FSA) that involves the same set of subproblems, the combination of these (sub)policies can then be used to generate an optimal solution without additional learning. In contrast to other methods that combine (sub)policies via planning, our method asymptotically attains global optimality, even in stochastic environments.
☆ Blockchain-based Pseudonym Management for Vehicle Twin Migrations in Vehicular Edge Metaverse
Driven by the great advances in metaverse and edge computing technologies, vehicular edge metaverses are expected to disrupt the current paradigm of intelligent transportation systems. As highly computerized avatars of Vehicular Metaverse Users (VMUs), the Vehicle Twins (VTs) deployed in edge servers can provide valuable metaverse services to improve driving safety and on-board satisfaction for their VMUs throughout journeys. To maintain uninterrupted metaverse experiences, VTs must be migrated among edge servers following the movements of vehicles. This can raise concerns about privacy breaches during the dynamic communications among vehicular edge metaverses. To address these concerns and safeguard location privacy, pseudonyms as temporary identifiers can be leveraged by both VMUs and VTs to realize anonymous communications in the physical space and virtual spaces. However, existing pseudonym management methods fall short in meeting the extensive pseudonym demands in vehicular edge metaverses, thus dramatically diminishing the performance of privacy preservation. To this end, we present a cross-metaverse empowered dual pseudonym management framework. We utilize cross-chain technology to enhance management efficiency and data security for pseudonyms. Furthermore, we propose a metric to assess the privacy level and employ a Multi-Agent Deep Reinforcement Learning (MADRL) approach to obtain an optimal pseudonym generating strategy. Numerical results demonstrate that our proposed schemes are high-efficiency and cost-effective, showcasing their promising applications in vehicular edge metaverses.
comment: 14 pages, 9 figures
☆ Parametric PDE Control with Deep Reinforcement Learning and Differentiable L0-Sparse Polynomial Policies
Optimal control of parametric partial differential equations (PDEs) is crucial in many applications in engineering and science. In recent years, the progress in scientific machine learning has opened up new frontiers for the control of parametric PDEs. In particular, deep reinforcement learning (DRL) has the potential to solve high-dimensional and complex control problems in a large variety of applications. Most DRL methods rely on deep neural network (DNN) control policies. However, for many dynamical systems, DNN-based control policies tend to be over-parametrized, which means they need large amounts of training data, show limited robustness, and lack interpretability. In this work, we leverage dictionary learning and differentiable L$_0$ regularization to learn sparse, robust, and interpretable control policies for parametric PDEs. Our sparse policy architecture is agnostic to the DRL method and can be used in different policy-gradient and actor-critic DRL algorithms without changing their policy-optimization procedure. We test our approach on the challenging tasks of controlling parametric Kuramoto-Sivashinsky and convection-diffusion-reaction PDEs. We show that our method (1) outperforms baseline DNN-based DRL policies, (2) allows for the derivation of interpretable equations of the learned optimal control laws, and (3) generalizes to unseen parameters of the PDE without retraining the policies.
☆ Federated Bayesian Deep Learning: The Application of Statistical Aggregation Methods to Bayesian Models
Federated learning (FL) is an approach to training machine learning models that takes advantage of multiple distributed datasets while maintaining data privacy and reducing communication costs associated with sharing local datasets. Aggregation strategies have been developed to pool or fuse the weights and biases of distributed deterministic models; however, modern deterministic deep learning (DL) models are often poorly calibrated and lack the ability to communicate a measure of epistemic uncertainty in prediction, which is desirable for remote sensing platforms and safety-critical applications. Conversely, Bayesian DL models are often well calibrated and capable of quantifying and communicating a measure of epistemic uncertainty along with a competitive prediction accuracy. Unfortunately, because the weights and biases in Bayesian DL models are defined by a probability distribution, simple application of the aggregation methods associated with FL schemes for deterministic models is either impossible or results in sub-optimal performance. In this work, we use independent and identically distributed (IID) and non-IID partitions of the CIFAR-10 dataset and a fully variational ResNet-20 architecture to analyze six different aggregation strategies for Bayesian DL models. Additionally, we analyze the traditional federated averaging approach applied to an approximate Bayesian Monte Carlo dropout model as a lightweight alternative to more complex variational inference methods in FL. We show that aggregation strategy is a key hyperparameter in the design of a Bayesian FL system with downstream effects on accuracy, calibration, uncertainty quantification, training stability, and client compute requirements.
comment: 22 pages, 9 figures
☆ Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach
Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.
☆ Spectral Motion Alignment for Video Motion Transfer using Diffusion Models
The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models (VDMs) have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
comment: Project page: https://geonyeong-park.github.io/spectral-motion-alignment/
☆ FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
Modern Large Language Models (LLMs) are capable of following long and complex instructions that enable a diverse amount of user tasks. However, despite Information Retrieval (IR) models using LLMs as the backbone of their architectures, nearly all of them still only take queries as input, with no instructions. For the handful of recent models that do take instructions, it's unclear how they use them. We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR builds off the long history of the TREC conferences: as TREC provides human annotators with instructions (also known as narratives) to determine document relevance, so should IR models be able to understand and decide relevance based on these detailed instructions. Our evaluation benchmark starts with three deeply judged TREC collections and alters the annotator instructions, re-annotating relevant documents. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements (over 13%) after fine-tuning on our training set.
☆ Reasoning-Enhanced Object-Centric Learning for Videos
Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
☆ A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness
Classical convergence analyses for optimization algorithms rely on the widely-adopted uniform smoothness assumption. However, recent experimental studies have demonstrated that many machine learning problems exhibit non-uniform smoothness, meaning the smoothness factor is a function of the model parameter instead of a universal constant. In particular, it has been observed that the smoothness grows with respect to the gradient norm along the training trajectory. Motivated by this phenomenon, the recently introduced $(L_0, L_1)$-smoothness is a more general notion, compared to traditional $L$-smoothness, that captures such positive relationship between smoothness and gradient norm. Under this type of non-uniform smoothness, existing literature has designed stochastic first-order algorithms by utilizing gradient clipping techniques to obtain the optimal $\mathcal{O}(\epsilon^{-3})$ sample complexity for finding an $\epsilon$-approximate first-order stationary solution. Nevertheless, the studies of quasi-Newton methods are still lacking. Considering higher accuracy and more robustness for quasi-Newton methods, in this paper we propose a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness. Leveraging gradient clipping and variance reduction, our algorithm can achieve the best-known $\mathcal{O}(\epsilon^{-3})$ sample complexity and enjoys convergence speedup with simple hyperparameter tuning. Our numerical experiments show that our proposed algorithm outperforms the state-of-the-art approaches.
☆ Robust Utility Optimization via a GAN Approach
Robust utility optimization enables an investor to deal with market uncertainty in a structured way, with the goal of maximizing the worst-case outcome. In this work, we propose a generative adversarial network (GAN) approach to (approximately) solve robust utility optimization problems in general and realistic settings. In particular, we model both the investor and the market by neural networks (NN) and train them in a mini-max zero-sum game. This approach is applicable for any continuous utility function and in realistic market settings with trading costs, where only observable information of the market can be used. A large empirical study shows the versatile usability of our method. Whenever an optimal reference strategy is available, our method performs on par with it and in the (many) settings without known optimal strategy, our method outperforms all other reference strategies. Moreover, we can conclude from our study that the trained path-dependent strategies do not outperform Markovian ones. Lastly, we uncover that our generative approach for learning optimal, (non-) robust investments under trading costs generates universally applicable alternatives to well known asymptotic strategies of idealized settings.
☆ Guided Decoding for Robot Motion Generation and Adaption
We address motion generation for high-DoF robot arms in complex settings with obstacles, via points, etc. A significant advancement in this domain is achieved by integrating Learning from Demonstration (LfD) into the motion generation process. This integration facilitates rapid adaptation to new tasks and optimizes the utilization of accumulated expertise by allowing robots to learn and generalize from demonstrated trajectories. We train a transformer architecture on a large dataset of simulated trajectories. This architecture, based on a conditional variational autoencoder transformer, learns essential motion generation skills and adapts these to meet auxiliary tasks and constraints. Our auto-regressive approach enables real-time integration of feedback from the physical system, enhancing the adaptability and efficiency of motion generation. We show that our model can generate motion from initial and target points, but also that it can adapt trajectories in navigating complex tasks, including obstacle avoidance, via points, and meeting velocity and acceleration constraints, across platforms.
comment: 7 pages
☆ An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets
Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 large language models trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code. Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.
comment: Accepted to FORGE 2024
☆ Anytime, Anywhere, Anyone: Investigating the Feasibility of Segment Anything Model for Crowd-Sourcing Medical Image Annotations
Curating annotations for medical image segmentation is a labor-intensive and time-consuming task that requires domain expertise, resulting in "narrowly" focused deep learning (DL) models with limited translational utility. Recently, foundation models like the Segment Anything Model (SAM) have revolutionized semantic segmentation with exceptional zero-shot generalizability across various domains, including medical imaging, and hold a lot of promise for streamlining the annotation process. However, SAM has yet to be evaluated in a crowd-sourced setting to curate annotations for training 3D DL segmentation models. In this work, we explore the potential of SAM for crowd-sourcing "sparse" annotations from non-experts to generate "dense" segmentation masks for training 3D nnU-Net models, a state-of-the-art DL segmentation model. Our results indicate that while SAM-generated annotations exhibit high mean Dice scores compared to ground-truth annotations, nnU-Net models trained on SAM-generated annotations perform significantly worse than nnU-Net models trained on ground-truth annotations ($p<0.001$, all).
☆ Early Period of Training Impacts Out-of-Distribution Generalization
Prior research has found that differences in the early period of neural network training significantly impact the performance of in-distribution (ID) tasks. However, neural networks are often sensitive to out-of-distribution (OOD) data, making them less reliable in downstream applications. Yet, the impact of the early training period on OOD generalization remains understudied due to its complexity and lack of effective analytical methodologies. In this work, we investigate the relationship between learning dynamics and OOD generalization during the early period of neural network training. We utilize the trace of Fisher Information and sharpness, with a focus on gradual unfreezing (i.e. progressively unfreezing parameters during training) as the methodology for investigation. Through a series of empirical experiments, we show that 1) selecting the number of trainable parameters at different times during training, i.e. realized by gradual unfreezing -- has a minuscule impact on ID results, but greatly affects the generalization to OOD data; 2) the absolute values of sharpness and trace of Fisher Information at the initial period of training are not indicative for OOD generalization, but the relative values could be; 3) the trace of Fisher Information and sharpness may be used as indicators for the removal of interventions during early period of training for better OOD generalization.
comment: WIP
☆ Robust optimization for adversarial learning with finite sample complexity guarantees
Decision making and learning in the presence of uncertainty has attracted significant attention in view of the increasing need to achieve robust and reliable operations. In the case where uncertainty stems from the presence of adversarial attacks this need is becoming more prominent. In this paper we focus on linear and nonlinear classification problems and propose a novel adversarial training method for robust classifiers, inspired by Support Vector Machine (SVM) margins. We view robustness under a data driven lens, and derive finite sample complexity bounds for both linear and non-linear classifiers in binary and multi-class scenarios. Notably, our bounds match natural classifiers' complexity. Our algorithm minimizes a worst-case surrogate loss using Linear Programming (LP) and Second Order Cone Programming (SOCP) for linear and non-linear models. Numerical experiments on the benchmark MNIST and CIFAR10 datasets show our approach's comparable performance to state-of-the-art methods, without needing adversarial examples during training. Our work offers a comprehensive framework for enhancing binary linear and non-linear classifier robustness, embedding robustness in learning under the presence of adversaries.
☆ FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication ICDE
Serverless computing offers attractive scalability, elasticity and cost-effectiveness. However, constraints on memory, CPU and function runtime have hindered its adoption for data-intensive applications and machine learning (ML) workloads. Traditional 'server-ful' platforms enable distributed computation via fast networks and well-established inter-process communication (IPC) mechanisms such as MPI and shared memory. In the absence of such solutions in the serverless domain, parallel computation with significant IPC requirements is challenging. We present FSD-Inference, the first fully serverless and highly scalable system for distributed ML inference. We explore potential communication channels, in conjunction with Function-as-a-Service (FaaS) compute, to design a state-of-the-art solution for distributed ML within the context of serverless data-intensive computing. We introduce novel fully serverless communication schemes for ML inference workloads, leveraging both cloud-based publish-subscribe/queueing and object storage offerings. We demonstrate how publish-subscribe/queueing services can be adapted for FaaS IPC with comparable performance to object storage, while offering significantly reduced cost at high parallelism levels. We conduct in-depth experiments on benchmark DNNs of various sizes. The results show that when compared to server-based alternatives, FSD-Inference is significantly more cost-effective and scalable, and can even achieve competitive performance against optimized HPC solutions. Experiments also confirm that our serverless solution can handle large distributed workloads and leverage high degrees of FaaS parallelism.
comment: In Proceedings of 2024 IEEE 40th International Conference on Data Engineering (ICDE) (to appear)
☆ Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion
The landscape of deep learning research is moving towards innovative strategies to harness the true potential of data. Traditionally, emphasis has been on scaling model architectures, resulting in large and complex neural networks, which can be difficult to train with limited computational resources. However, independently of the model size, data quality (i.e. amount and variability) is still a major factor that affects model generalization. In this work, we propose a novel technique to exploit available data through the use of automatic data augmentation for the tasks of image classification and semantic segmentation. We introduce the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos. Compared to previous approaches, DAS is extremely fast and flexible, allowing the search on very large search spaces in less than a GPU day. Our intuition is that the increased receptive field in the temporal dimension provided by DAS could lead to benefits also to the spatial receptive field. More specifically, we leverage DAS to guide the reshaping of the spatial receptive field by selecting task-dependant transformations. As a result, compared to standard augmentation alternatives, we improve in terms of accuracy on ImageNet, Cifar10, Cifar100, Tiny-ImageNet, Pascal-VOC-2012 and CityScapes datasets when plugging-in our DAS over different light-weight video backbones.
☆ PDE-CNNs: Axiomatic Derivations and Applications
PDE-based Group Convolutional Neural Networks (PDE-G-CNNs) utilize solvers of geometrically meaningful evolution PDEs as substitutes for the conventional components in G-CNNs. PDE-G-CNNs offer several key benefits all at once: fewer parameters, inherent equivariance, better performance, data efficiency, and geometric interpretability. In this article we focus on Euclidean equivariant PDE-G-CNNs where the feature maps are two dimensional throughout. We call this variant of the framework a PDE-CNN. We list several practically desirable axioms and derive from these which PDEs should be used in a PDE-CNN. Here our approach to geometric learning via PDEs is inspired by the axioms of classical linear and morphological scale-space theory, which we generalize by introducing semifield-valued signals. Furthermore, we experimentally confirm for small networks that PDE-CNNs offer fewer parameters, better performance, and data efficiency in comparison to CNNs. We also investigate what effect the use of different semifields has on the performance of the models.
☆ Self-Improvement for Neural Combinatorial Optimization: Sample without Replacement, but Improvement
Current methods for end-to-end constructive neural combinatorial optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. While behavior cloning is straightforward, it requires expensive expert solutions, and policy gradient methods are often computationally demanding and complex to fine-tune. In this work, we bridge the two and simplify the training process by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. To achieve progressively improving solutions with minimal sampling, we introduce a method that combines round-wise Stochastic Beam Search with an update strategy derived from a provable policy improvement. This strategy refines the policy between rounds by utilizing the advantage of the sampled sequences with almost no computational overhead. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data. Additionally, we apply our method to the Job Shop Scheduling Problem using a transformer-based architecture and outperform existing state-of-the-art methods by a wide margin.
☆ Exploring the Task-agnostic Trait of Self-supervised Learning in the Context of Detecting Mental Disorders
Self-supervised learning (SSL) has been investigated to generate task-agnostic representations across various domains. However, such investigation has not been conducted for detecting multiple mental disorders. The rationale behind the existence of a task-agnostic representation lies in the overlapping symptoms among multiple mental disorders. Consequently, the behavioural data collected for mental health assessment may carry a mixed bag of attributes related to multiple disorders. Motivated by that, in this study, we explore a task-agnostic representation derived through SSL in the context of detecting major depressive disorder (MDD) and post-traumatic stress disorder (PTSD) using audio and video data collected during interactive sessions. This study employs SSL models trained by predicting multiple fixed targets or masked frames. We propose a list of fixed targets to make the generated representation more efficient for detecting MDD and PTSD. Furthermore, we modify the hyper-parameters of the SSL encoder predicting fixed targets to generate global representations that capture varying temporal contexts. Both these innovations are noted to yield improved detection performances for considered mental disorders and exhibit task-agnostic traits. In the context of the SSL model predicting masked frames, the generated global representations are also noted to exhibit task-agnostic traits.
☆ Transition Graph Properties of Target Class Classification
Target class classification is a mixed classification and transition model whose integrated goal is to assign objects to a certain, so called target or normal class. The classification process is iterative, and in each step an object in a certain class undergoes an action attached to that class, initiating the transition of the object to one of the classes. The sequence of transitions, which we call class transitions, must be designed to provide the final assignment of objects to the target class. The transition process can be described in the form of a directed graph, and the success of the final classification is mainly due to the properties of this graph. In our previous research we showed that the desirable structure of the transition graph is an oriented rooted tree with orientation towards the root vertex, which corresponds to the normal class. It is clear that the transition graph of an arbitrary algorithm (policy) may not have this property. In this paper we study the structure of realistic transition graphs, which makes it possible to find classification inconsistencies, helping to transfer it into the desired form. The medical interpretation of dynamic treatment regime considered in the article further clarifies the investigated framework.
comment: 14pages, 4 figures
☆ An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning
In recent years, Deep Learning has gained popularity for its ability to solve complex classification tasks, increasingly delivering better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure how similar are the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.
☆ On the Convergence of Adam under Non-uniform Smoothness: Separability from SGDM and Beyond
This paper aims to clearly distinguish between Stochastic Gradient Descent with Momentum (SGDM) and Adam in terms of their convergence rates. We demonstrate that Adam achieves a faster convergence compared to SGDM under the condition of non-uniformly bounded smoothness. Our findings reveal that: (1) in deterministic environments, Adam can attain the known lower bound for the convergence rate of deterministic first-order optimizers, whereas the convergence rate of Gradient Descent with Momentum (GDM) has higher order dependence on the initial function value; (2) in stochastic setting, Adam's convergence rate upper bound matches the lower bounds of stochastic first-order optimizers, considering both the initial function value and the final error, whereas there are instances where SGDM fails to converge with any learning rate. These insights distinctly differentiate Adam and SGDM regarding their convergence rates. Additionally, by introducing a novel stopping-time based technique, we further prove that if we consider the minimum gradient norm during iterations, the corresponding convergence rate can match the lower bounds across all problem hyperparameters. The technique can also help proving that Adam with a specific hyperparameter scheduler is parameter-agnostic, which hence can be of independent interest.
☆ Quantification using Permutation-Invariant Networks based on Histograms
Quantification, also known as class prevalence estimation, is the supervised learning task in which a model is trained to predict the prevalence of each class in a given bag of examples. This paper investigates the application of deep neural networks to tasks of quantification in scenarios where it is possible to apply a symmetric supervised approach that eliminates the need for classification as an intermediary step, directly addressing the quantification problem. Additionally, it discusses existing permutation-invariant layers designed for set processing and assesses their suitability for quantification. In light of our analysis, we propose HistNetQ, a novel neural architecture that relies on a permutation-invariant representation based on histograms that is specially suited for quantification problems. Our experiments carried out in the only quantification competition held to date, show that HistNetQ outperforms other deep neural architectures devised for set processing, as well as the state-of-the-art quantification methods. Furthermore, HistNetQ offers two significant advantages over traditional quantification methods: i) it does not require the labels of the training examples but only the prevalence values of a collection of training bags, making it applicable to new scenarios; and ii) it is able to optimize any custom quantification-oriented loss function.
☆ Text clustering with LLM embeddings
Text clustering is an important approach for organising the growing amount of digital content, helping to structure and find hidden patterns in uncategorised data. In this research, we investigated how different textual embeddings - particularly those used in large language models (LLMs) - and clustering algorithms affect how text datasets are clustered. A series of experiments were conducted to assess how embeddings influence clustering results, the role played by dimensionality reduction through summarisation, and embedding size adjustment. Results reveal that LLM embeddings excel at capturing the nuances of structured language, while BERT leads the lightweight options in performance. In addition, we find that increasing embedding dimensionality and summarisation techniques do not uniformly improve clustering efficiency, suggesting that these strategies require careful analysis to use in real-life models. These results highlight a complex balance between the need for nuanced text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by incorporating embeddings from LLMs, thereby paving the way for improved methodologies and opening new avenues for future research in various types of textual analysis.
☆ Active Learning for Regression based on Wasserstein distance and GroupSort Neural Networks
This paper addresses a new active learning strategy for regression problems. The presented Wasserstein active regression model is based on the principles of distribution-matching to measure the representativeness of the labeled dataset. The Wasserstein distance is computed using GroupSort Neural Networks. The use of such networks provides theoretical foundations giving a way to quantify errors with explicit bounds for their size and depth. This solution is combined with another uncertainty-based approach that is more outlier-tolerant to complete the query strategy. Finally, this method is compared with other classical and recent solutions. The study empirically shows the pertinence of such a representativity-uncertainty approach, which provides good estimation all along the query procedure. Moreover, the Wasserstein active regression often achieves more precise estimations and tends to improve accuracy faster than other models.
☆ Improving cross-domain brain tissue segmentation in fetal MRI with synthetic data
Segmentation of fetal brain tissue from magnetic resonance imaging (MRI) plays a crucial role in the study of in utero neurodevelopment. However, automated tools face substantial domain shift challenges as they must be robust to highly heterogeneous clinical data, often limited in numbers and lacking annotations. Indeed, high variability of the fetal brain morphology, MRI acquisition parameters, and superresolution reconstruction (SR) algorithms adversely affect the model's performance when evaluated out-of-domain. In this work, we introduce FetalSynthSeg, a domain randomization method to segment fetal brain MRI, inspired by SynthSeg. Our results show that models trained solely on synthetic data outperform models trained on real data in out-ofdomain settings, validated on a 120-subject cross-domain dataset. Furthermore, we extend our evaluation to 40 subjects acquired using lowfield (0.55T) MRI and reconstructed with novel SR models, showcasing robustness across different magnetic field strengths and SR algorithms. Leveraging a generative synthetic approach, we tackle the domain shift problem in fetal brain MRI and offer compelling prospects for applications in fields with limited and highly heterogeneous data.
comment: 10 pages, 5 figures, 1 table
☆ End-to-End Mineral Exploration with Artificial Intelligence and Ambient Noise Tomography
This paper presents an innovative end-to-end workflow for mineral exploration, integrating ambient noise tomography (ANT) and artificial intelligence (AI) to enhance the discovery and delineation of mineral resources essential for the global transition to a low carbon economy. We focus on copper as a critical element, required in significant quantities for renewable energy solutions. We show the benefits of utilising ANT, characterised by its speed, scalability, depth penetration, resolution, and low environmental impact, alongside artificial intelligence (AI) techniques to refine a continent-scale prospectivity model at the deposit scale by fine-tuning our model on local high-resolution data. We show the promise of the method by first presenting a new data-driven AI prospectivity model for copper within Australia, which serves as our foundation model for further fine-tuning. We then focus on the Hillside IOCG deposit on the prospective Yorke Peninsula. We show that with relatively few local training samples (orebody intercepts), we can fine tune the foundation model to provide a good estimate of the Hillside orebody outline. Our methodology demonstrates how AI can augment geophysical data interpretation, providing a novel approach to mineral exploration with improved decision-making capabilities for targeting mineralization, thereby addressing the urgent need for increased mineral resource discovery.
☆ Improved Long Short-Term Memory-based Wastewater Treatment Simulators for Deep Reinforcement Learning
Even though Deep Reinforcement Learning (DRL) showed outstanding results in the fields of Robotics and Games, it is still challenging to implement it in the optimization of industrial processes like wastewater treatment. One of the challenges is the lack of a simulation environment that will represent the actual plant as accurately as possible to train DRL policies. Stochasticity and non-linearity of wastewater treatment data lead to unstable and incorrect predictions of models over long time horizons. One possible reason for the models' incorrect simulation behavior can be related to the issue of compounding error, which is the accumulation of errors throughout the simulation. The compounding error occurs because the model utilizes its predictions as inputs at each time step. The error between the actual data and the prediction accumulates as the simulation continues. We implemented two methods to improve the trained models for wastewater treatment data, which resulted in more accurate simulators: 1- Using the model's prediction data as input in the training step as a tool of correction, and 2- Change in the loss function to consider the long-term predicted shape (dynamics). The experimental results showed that implementing these methods can improve the behavior of simulators in terms of Dynamic Time Warping throughout a year up to 98% compared to the base model. These improvements demonstrate significant promise in creating simulators for biological processes that do not need pre-existing knowledge of the process but instead depend exclusively on time series data obtained from the system.
☆ SIMAP: A simplicial-map layer for neural networks
In this paper, we present SIMAP, a novel layer integrated into deep learning models, aimed at enhancing the interpretability of the output. The SIMAP layer is an enhanced version of Simplicial-Map Neural Networks (SMNNs), an explainable neural network based on support sets and simplicial maps (functions used in topology to transform shapes while preserving their structural connectivity). The novelty of the methodology proposed in this paper is two-fold: Firstly, SIMAP layers work in combination with other deep learning architectures as an interpretable layer substituting classic dense final layers. Secondly, unlike SMNNs, the support set is based on a fixed maximal simplex, the barycentric subdivision being efficiently computed with a matrix-based multiplication algorithm.
☆ Automated Feature Selection for Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) is an imitation learning approach to learning reward functions from expert demonstrations. Its use avoids the difficult and tedious procedure of manual reward specification while retaining the generalization power of reinforcement learning. In IRL, the reward is usually represented as a linear combination of features. In continuous state spaces, the state variables alone are not sufficiently rich to be used as features, but which features are good is not known in general. To address this issue, we propose a method that employs polynomial basis functions to form a candidate set of features, which are shown to allow the matching of statistical moments of state distributions. Feature selection is then performed for the candidates by leveraging the correlation between trajectory probabilities and feature expectations. We demonstrate the approach's effectiveness by recovering reward functions that capture expert policies across non-linear control tasks of increasing complexity. Code, data, and videos are available at https://sites.google.com/view/feature4irl.
comment: 7 pages, 4 figures
☆ GTAGCN: Generalized Topology Adaptive Graph Convolutional Networks
Graph Neural Networks (GNN) have emerged as a popular and standard approach for learning from graph-structured data. The literature on GNN highlights the potential of this evolving research area and its widespread adoption in real-life applications. However, most of the approaches are either new in concept or derived from specific techniques. Therefore, the potential of more than one approach in hybrid form has not been studied extensively, which can be well utilized for sequenced data or static data together. We derive a hybrid approach based on two established techniques as generalized aggregation networks and topology adaptive graph convolution networks that solve our purpose to apply on both types of sequenced and static nature of data, effectively. The proposed method applies to both node and graph classification. Our empirical analysis reveals that the results are at par with literature results and better for handwritten strokes as sequenced data, where graph structures have not been explored.
comment: 2 figures, 3 tables and 26 pages
☆ On the Inclusion of Charge and Spin States in Cartesian Tensor Neural Network Potentials
In this letter, we present an extension to TensorNet, a state-of-the-art equivariant Cartesian tensor neural network potential, allowing it to handle charged molecules and spin states without architectural changes or increased costs. By incorporating these attributes, we address input degeneracy issues, enhancing the model's predictive accuracy across diverse chemical systems. This advancement significantly broadens TensorNet's applicability, maintaining its efficiency and accuracy.
☆ Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Large-scale Text-to-Image (TTI) models have become a common approach for generating training data in various generative fields. However, visual hallucinations, which contain perceptually critical defects, remain a concern, especially in non-photorealistic styles like cartoon characters. We propose a novel visual hallucination detection system for cartoon character images generated by TTI models. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. This research advances TTI models by mitigating visual hallucinations, expanding their potential in non-photorealistic domains.
comment: 11 pages, 12 figures, 1 table, Project page: https://gh-bumsookim.github.io/Cartoon-Hallucinations-Detection/
☆ DP-Dueling: Learning from Preference Feedback without Compromising User Privacy
We consider the well-studied dueling bandit problem, where a learner aims to identify near-optimal actions using pairwise comparisons, under the constraint of differential privacy. We consider a general class of utility-based preference matrices for large (potentially unbounded) decision spaces and give the first differentially private dueling bandit algorithm for active learning with user preferences. Our proposed algorithms are computationally efficient with near-optimal performance, both in terms of the private and non-private regret bound. More precisely, we show that when the decision space is of finite size $K$, our proposed algorithm yields order optimal $O\Big(\sum_{i = 2}^K\log\frac{KT}{\Delta_i} + \frac{K}{\epsilon}\Big)$ regret bound for pure $\epsilon$-DP, where $\Delta_i$ denotes the suboptimality gap of the $i$-th arm. We also present a matching lower bound analysis which proves the optimality of our algorithms. Finally, we extend our results to any general decision space in $d$-dimensions with potentially infinite arms and design an $\epsilon$-DP algorithm with regret $\tilde{O} \left( \frac{d^6}{\kappa \epsilon } + \frac{ d\sqrt{T }}{\kappa} \right)$, providing privacy for free when $T \gg d$.
☆ Estimation of multiple mean vectors in high dimension
We endeavour to estimate numerous multi-dimensional means of various probability distributions on a common space based on independent samples. Our approach involves forming estimators through convex combinations of empirical means derived from these samples. We introduce two strategies to find appropriate data-dependent convex combination weights: a first one employing a testing procedure to identify neighbouring means with low variance, which results in a closed-form plug-in formula for the weights, and a second one determining weights via minimization of an upper confidence bound on the quadratic risk.Through theoretical analysis, we evaluate the improvement in quadratic risk offered by our methods compared to the empirical means. Our analysis focuses on a dimensional asymptotics perspective, showing that our methods asymptotically approach an oracle (minimax) improvement as the effective dimension of the data increases.We demonstrate the efficacy of our methods in estimating multiple kernel mean embeddings through experiments on both simulated and real-world datasets.
☆ Image Classification with Rotation-Invariant Variational Quantum Circuits
Variational quantum algorithms are gaining attention as an early application of Noisy Intermediate-Scale Quantum (NISQ) devices. One of the main problems of variational methods lies in the phenomenon of Barren Plateaus, present in the optimization of variational parameters. Adding geometric inductive bias to the quantum models has been proposed as a potential solution to mitigate this problem, leading to a new field called Geometric Quantum Machine Learning. In this work, an equivariant architecture for variational quantum classifiers is introduced to create a label-invariant model for image classification with $C_4$ rotational label symmetry. The equivariant circuit is benchmarked against two different architectures, and it is experimentally observed that the geometric approach boosts the model's performance. Finally, a classical equivariant convolution operation is proposed to extend the quantum model for the processing of larger images, employing the resources available in NISQ devices.
comment: 9 pages, 9 figures
☆ Grey-informed neural network for time-series forecasting
Neural network models have shown outstanding performance and successful resolutions to complex problems in various fields. However, the majority of these models are viewed as black-box, requiring a significant amount of data for development. Consequently, in situations with limited data, constructing appropriate models becomes challenging due to the lack of transparency and scarcity of data. To tackle these challenges, this study suggests the implementation of a grey-informed neural network (GINN). The GINN ensures that the output of the neural network follows the differential equation model of the grey system, improving interpretability. Moreover, incorporating prior knowledge from grey system theory enables traditional neural networks to effectively handle small data samples. Our proposed model has been observed to uncover underlying patterns in the real world and produce reliable forecasts based on empirical data.
☆ Robust Conformal Prediction under Distribution Shift via Physics-Informed Structural Causal Model
Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, hoping the set to cover the true label with at least $(1-\alpha)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice, when the conditional distribution $P_{Y|X}$ is different on calibration and test data, the coverage is not guaranteed and it is essential to measure and minimize the coverage loss under distributional shift at \textit{all} possible confidence levels. To address these issues, we upper bound the coverage difference at all levels using the cumulative density functions of calibration and test conformal scores and Wasserstein distance. Inspired by the invariance of physics across data distributions, we propose a physics-informed structural causal model (PI-SCM) to reduce the upper bound. We validated that PI-SCM can improve coverage robustness along confidence level and test domain on a traffic speed prediction task and an epidemic spread task with multiple real-world datasets.
☆ Insights into the Lottery Ticket Hypothesis and the Iterative Magnitude Pruning
Lottery ticket hypothesis for deep neural networks emphasizes the importance of initialization used to re-train the sparser networks obtained using the iterative magnitude pruning process. An explanation for why the specific initialization proposed by the lottery ticket hypothesis tends to work better in terms of generalization (and training) performance has been lacking. Moreover, the underlying principles in iterative magnitude pruning, like the pruning of smaller magnitude weights and the role of the iterative process, lack full understanding and explanation. In this work, we attempt to provide insights into these phenomena by empirically studying the volume/geometry and loss landscape characteristics of the solutions obtained at various stages of the iterative magnitude pruning process.
☆ Vehicle Detection Performance in Nordic Region ICPR2024
This paper addresses the critical challenge of vehicle detection in the harsh winter conditions in the Nordic regions, characterized by heavy snowfall, reduced visibility, and low lighting. Due to their susceptibility to environmental distortions and occlusions, traditional vehicle detection methods have struggled in these adverse conditions. The advanced proposed deep learning architectures brought promise, yet the unique difficulties of detecting vehicles in Nordic winters remain inadequately addressed. This study uses the Nordic Vehicle Dataset (NVD), which has UAV images from northern Sweden, to evaluate the performance of state-of-the-art vehicle detection algorithms under challenging weather conditions. Our methodology includes a comprehensive evaluation of single-stage, two-stage, and transformer-based detectors against the NVD. We propose a series of enhancements tailored to each detection framework, including data augmentation, hyperparameter tuning, transfer learning, and novel strategies designed explicitly for the DETR model. Our findings not only highlight the limitations of current detection systems in the Nordic environment but also offer promising directions for enhancing these algorithms for improved robustness and accuracy in vehicle detection amidst the complexities of winter landscapes. The code and the dataset are available at https://nvd.ltu-ai.dev
comment: submitted to ICPR2024
☆ Empirical investigation of multi-source cross-validation in clinical machine learning
Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data.
comment: 14 pages, 3 figures
☆ ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding
This work presents ParFormer as an enhanced transformer architecture that allows the incorporation of different token mixers into a single stage, hence improving feature extraction capabilities. Integrating both local and global data allows for precise representation of short- and long-range spatial relationships without the need for computationally intensive methods such as shifting windows. Along with the parallel token mixer encoder, We offer the Convolutional Attention Patch Embedding (CAPE) as an enhancement of standard patch embedding to improve token mixer extraction with a convolutional attention module. Our comprehensive evaluation demonstrates that our ParFormer outperforms CNN-based and state-of-the-art transformer-based architectures in image classification and several complex tasks such as object recognition. The proposed CAPE has been demonstrated to benefit the overall MetaFormer architecture, even while utilizing the Identity Mapping Token Mixer, resulting in a 0.5\% increase in accuracy. The ParFormer models outperformed ConvNeXt and Swin Transformer for the pure convolution and transformer model in accuracy. Furthermore, our model surpasses the current leading hybrid transformer by reaching competitive Top-1 scores in the ImageNet-1K classification test. Specifically, our model variants with 11M, 23M, and 34M parameters achieve scores of 80.4\%, 82.1\%, and 83.1\%, respectively. Code: https://github.com/novendrastywn/ParFormer-CAPE-2024
☆ Magic for the Age of Quantized DNNs
Recently, the number of parameters in DNNs has explosively increased, as exemplified by LLMs (Large Language Models), making inference on small-scale computers more difficult. Model compression technology is, therefore, essential for integration into products. In this paper, we propose a method of quantization-aware training. We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional computation cost during inference. Then, we quantize the weights by the scaled round-clip function with the weight standardization. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions. We call this method Magic for the age of Quantised DNNs (MaQD). Experimental results show that our quantization method can be achieved with minimal accuracy degradation.
comment: 14 pages, 5 figures, 4 tables
☆ Piecewise-Linear Manifolds for Deep Metric Learning
Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.
comment: Accepted at CPAL 2024 (Oral)
☆ Trajectory Regularization Enhances Self-Supervised Geometric Representation
Self-supervised learning (SSL) has proven effective in learning high-quality representations for various downstream tasks, with a primary focus on semantic tasks. However, its application in geometric tasks remains underexplored, partially due to the absence of a standardized evaluation method for geometric representations. To address this gap, we introduce a new pose-estimation benchmark for assessing SSL geometric representations, which demands training without semantic or pose labels and achieving proficiency in both semantic and geometric downstream tasks. On this benchmark, we study enhancing SSL geometric representations without sacrificing semantic classification accuracy. We find that leveraging mid-layer representations improves pose-estimation performance by 10-20%. Further, we introduce an unsupervised trajectory-regularization loss, which improves performance by an additional 4% and improves generalization ability on out-of-distribution data. We hope the proposed benchmark and methods offer new insights and improvements in self-supervised geometric representation learning.
☆ Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices
As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first moment enabled, and further increases these savings without the first moment. Besides, it enhances convergence speed and improves downstream task performance relative to its counterparts.
☆ Simple Graph Condensation
The burdensome training costs on large-scale graphs have aroused significant interest in graph condensation, which involves tuning Graph Neural Networks (GNNs) on a small condensed graph for use on the large-scale original graph. Existing methods primarily focus on aligning key metrics between the condensed and original graphs, such as gradients, distribution and trajectory of GNNs, yielding satisfactory performance on downstream tasks. However, these complex metrics necessitate intricate computations and can potentially disrupt the optimization process of the condensation graph, making the condensation process highly demanding and unstable. Motivated by the recent success of simplified models in various fields, we propose a simplified approach to metric alignment in graph condensation, aiming to reduce unnecessary complexity inherited from GNNs. In our approach, we eliminate external parameters and exclusively retain the target condensed graph during the condensation process. Following the hierarchical aggregation principles of GNNs, we introduce the Simple Graph Condensation (SimGC) framework, which aligns the condensed graph with the original graph from the input layer to the prediction layer, guided by a pre-trained Simple Graph Convolution (SGC) model on the original graph. As a result, both graphs possess the similar capability to train GNNs. This straightforward yet effective strategy achieves a significant speedup of up to 10 times compared to existing graph condensation methods while performing on par with state-of-the-art baselines. Comprehensive experiments conducted on seven benchmark datasets demonstrate the effectiveness of SimGC in prediction accuracy, condensation time, and generalization capability. Our code will be made publicly available.
comment: Under review
☆ KnowLA: Enhancing Parameter-efficient Finetuning with Knowledgeable Adaptation NAACL 2024
Parameter-efficient finetuning (PEFT) is a key technique for adapting large language models (LLMs) to downstream tasks. In this paper, we study leveraging knowledge graph embeddings to improve the effectiveness of PEFT. We propose a knowledgeable adaptation method called KnowLA. It inserts an adaptation layer into an LLM to integrate the embeddings of entities appearing in the input text. The adaptation layer is trained in combination with LoRA on instruction data. Experiments on six benchmarks with two popular LLMs and three knowledge graphs demonstrate the effectiveness and robustness of KnowLA. We show that \modelname can help activate the relevant parameterized knowledge in an LLM to answer a question without changing its parameters or input prompts.
comment: Accepted in the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024)
☆ Addressing Concept Shift in Online Time Series Forecasting: Detect-then-Adapt
Online updating of time series forecasting models aims to tackle the challenge of concept drifting by adjusting forecasting models based on streaming data. While numerous algorithms have been developed, most of them focus on model design and updating. In practice, many of these methods struggle with continuous performance regression in the face of accumulated concept drifts over time. To address this limitation, we present a novel approach, Concept \textbf{D}rift \textbf{D}etection an\textbf{D} \textbf{A}daptation (D3A), that first detects drifting conception and then aggressively adapts the current model to the drifted concepts after the detection for rapid adaption. To best harness the utility of historical data for model adaptation, we propose a data augmentation strategy introducing Gaussian noise into existing training instances. It helps mitigate the data distribution gap, a critical factor contributing to train-test performance inconsistency. The significance of our data augmentation process is verified by our theoretical analysis. Our empirical studies across six datasets demonstrate the effectiveness of D3A in improving model adaptation capability. Notably, compared to a simple Temporal Convolutional Network (TCN) baseline, D3A reduces the average Mean Squared Error (MSE) by $43.9\%$. For the state-of-the-art (SOTA) model, the MSE is reduced by $33.3\%$.
comment: 7 figures, 14 pages. arXiv admin note: text overlap with arXiv:2309.12659
☆ A Single Linear Layer Yields Task-Adapted Low-Rank Matrices LREC
Low-Rank Adaptation (LoRA) is a widely used Parameter-Efficient Fine-Tuning (PEFT) method that updates an initial weight matrix $W_0$ with a delta matrix $\Delta W$ consisted by two low-rank matrices $A$ and $B$. A previous study suggested that there is correlation between $W_0$ and $\Delta W$. In this study, we aim to delve deeper into relationships between $W_0$ and low-rank matrices $A$ and $B$ to further comprehend the behavior of LoRA. In particular, we analyze a conversion matrix that transform $W_0$ into low-rank matrices, which encapsulates information about the relationships. Our analysis reveals that the conversion matrices are similar across each layer. Inspired by these findings, we hypothesize that a single linear layer, which takes each layer's $W_0$ as input, can yield task-adapted low-rank matrices. To confirm this hypothesis, we devise a method named Conditionally Parameterized LoRA (CondLoRA) that updates initial weight matrices with low-rank matrices derived from a single linear layer. Our empirical results show that CondLoRA maintains a performance on par with LoRA, despite the fact that the trainable parameters of CondLoRA are fewer than those of LoRA. Therefore, we conclude that "a single linear layer yields task-adapted low-rank matrices."
comment: Accepted at LREC-COLING 2024
☆ Unifying Lane-Level Traffic Prediction from a Graph Structural Perspective: Benchmark and Baseline
Traffic prediction has long been a focal and pivotal area in research, witnessing both significant strides from city-level to road-level predictions in recent years. With the advancement of Vehicle-to-Everything (V2X) technologies, autonomous driving, and large-scale models in the traffic domain, lane-level traffic prediction has emerged as an indispensable direction. However, further progress in this field is hindered by the absence of comprehensive and unified evaluation standards, coupled with limited public availability of data and code. This paper extensively analyzes and categorizes existing research in lane-level traffic prediction, establishes a unified spatial topology structure and prediction tasks, and introduces a simple baseline model, GraphMLP, based on graph structure and MLP networks. We have replicated codes not publicly available in existing studies and, based on this, thoroughly and fairly assessed various models in terms of effectiveness, efficiency, and applicability, providing insights for practical applications. Additionally, we have released three new datasets and corresponding codes to accelerate progress in this field, all of which can be found on https://github.com/ShuhaoLii/TITS24LaneLevel-Traffic-Benchmark.
☆ Contrastive Learning on Multimodal Analysis of Electronic Health Records
Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of structured and unstructured data as separate entities, neglecting the inherent synergy between them. Specifically, the two important modalities contain clinically relevant, inextricably linked and complementary health information. A more complete picture of a patient's medical history is captured by the joint analysis of the two modalities of data. Despite the great success of multimodal contrastive learning on vision-language, its potential remains under-explored in the realm of multimodal EHR, particularly in terms of its theoretical understanding. To accommodate the statistical analysis of multimodal EHR data, in this paper, we propose a novel multimodal feature embedding generative model and design a multimodal contrastive loss to obtain the multimodal EHR feature representation. Our theoretical analysis demonstrates the effectiveness of multimodal learning compared to single-modality learning and connects the solution of the loss function to the singular value decomposition of a pointwise mutual information matrix. This connection paves the way for a privacy-preserving algorithm tailored for multimodal EHR feature representation learning. Simulation studies show that the proposed algorithm performs well under a variety of configurations. We further validate the clinical utility of the proposed algorithm in real-world EHR data.
comment: 34 pages
☆ CODA: A COst-efficient Test-time Domain Adaptation Mechanism for HAR
In recent years, emerging research on mobile sensing has led to novel scenarios that enhance daily life for humans, but dynamic usage conditions often result in performance degradation when systems are deployed in real-world settings. Existing solutions typically employ one-off adaptation schemes based on neural networks, which struggle to ensure robustness against uncertain drifting conditions in human-centric sensing scenarios. In this paper, we propose CODA, a COst-efficient Domain Adaptation mechanism for mobile sensing that addresses real-time drifts from the data distribution perspective with active learning theory, ensuring cost-efficient adaptation directly on the device. By incorporating a clustering loss and importance-weighted active learning algorithm, CODA retains the relationship between different clusters during cost-effective instance-level updates, preserving meaningful structure within the data distribution. We also showcase its generalization by seamlessly integrating it with Neural Network-based solutions for Human Activity Recognition tasks. Through meticulous evaluations across diverse datasets, including phone-based, watch-based, and integrated sensor-based sensing tasks, we demonstrate the feasibility and potential of online adaptation with CODA. The promising results achieved by CODA, even without learnable parameters, also suggest the possibility of realizing unobtrusive adaptation through specific application designs with sufficient feedback.
☆ Deep learning-based method for weather forecasting: A case study in Itoshima
Accurate weather forecasting is of paramount importance for a wide range of practical applications, drawing substantial scientific and societal interest. However, the intricacies of weather systems pose substantial challenges to accurate predictions. This research introduces a multilayer perceptron model tailored for weather forecasting in Itoshima, Kyushu, Japan. Our meticulously designed architecture demonstrates superior performance compared to existing models, surpassing benchmarks such as Long Short-Term Memory and Recurrent Neural Networks.
☆ Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective
In this paper, we study the feature learning ability of two-layer neural networks in the mean-field regime through the lens of kernel methods. To focus on the dynamics of the kernel induced by the first layer, we utilize a two-timescale limit, where the second layer moves much faster than the first layer. In this limit, the learning problem is reduced to the minimization problem over the intrinsic kernel. Then, we show the global convergence of the mean-field Langevin dynamics and derive time and particle discretization error. We also demonstrate that two-layer neural networks can learn a union of multiple reproducing kernel Hilbert spaces more efficiently than any kernel methods, and neural networks acquire data-dependent kernel which aligns with the target function. In addition, we develop a label noise procedure, which converges to the global optimum and show that the degrees of freedom appears as an implicit regularization.
☆ Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation
In this article, we address the problem of federated learning in the presence of stragglers. For this problem, a coded federated learning framework has been proposed, where the central server aggregates gradients received from the non-stragglers and gradient computed from a privacy-preservation global coded dataset to mitigate the negative impact of the stragglers. However, when aggregating these gradients, fixed weights are consistently applied across iterations, neglecting the generation process of the global coded dataset and the dynamic nature of the trained model over iterations. This oversight may result in diminished learning performance. To overcome this drawback, we propose a new method named adaptive coded federated learning (ACFL). In ACFL, before the training, each device uploads a coded local dataset with additive noise to the central server to generate a global coded dataset under privacy preservation requirements. During each iteration of the training, the central server aggregates the gradients received from the non-stragglers and the gradient computed from the global coded dataset, where an adaptive policy for varying the aggregation weights is designed. Under this policy, we optimize the performance in terms of privacy and learning, where the learning performance is analyzed through convergence analysis and the privacy performance is characterized via mutual information differential privacy. Finally, we perform simulations to demonstrate the superiority of ACFL compared with the non-adaptive methods.
☆ Hydro: Adaptive Query Processing of ML Queries
Query optimization in relational database management systems (DBMSs) is critical for fast query processing. The query optimizer relies on precise selectivity and cost estimates to effectively optimize queries prior to execution. While this strategy is effective for relational DBMSs, it is not sufficient for DBMSs tailored for processing machine learning (ML) queries. In ML-centric DBMSs, query optimization is challenging for two reasons. First, the performance bottleneck of the queries shifts to user-defined functions (UDFs) that often wrap around deep learning models, making it difficult to accurately estimate UDF statistics without profiling the query. This leads to inaccurate statistics and sub-optimal query plans. Second, the optimal query plan for ML queries is data-dependent, necessitating DBMSs to adapt the query plan on the fly during execution. So, a static query plan is not sufficient for such queries. In this paper, we present Hydro, an ML-centric DBMS that utilizes adaptive query processing (AQP) for efficiently processing ML queries. Hydro is designed to quickly evaluate UDF-based query predicates by ensuring optimal predicate evaluation order and improving the scalability of UDF execution. By integrating AQP, Hydro continuously monitors UDF statistics, routes data to predicates in an optimal order, and dynamically allocates resources for evaluating predicates. We demonstrate Hydro's efficacy through four illustrative use cases, delivering up to 11.52x speedup over a baseline system.
☆ Web-based Melanoma Detection
Melanoma is the most aggressive form of skin cancer, and early detection can significantly increase survival rates and prevent cancer spread. However, developing reliable automated detection techniques is difficult due to the lack of standardized datasets and evaluation methods. This study introduces a unified melanoma classification approach that supports 54 combinations of 11 datasets and 24 state-of-the-art deep learning architectures. It enables a fair comparison of 1,296 experiments and results in a lightweight model deployable to the web-based MeshNet architecture named Mela-D. This approach can run up to 33x faster by reducing parameters 24x to yield an analogous 88.8\% accuracy comparable with ResNet50 on previously unseen images. This allows efficient and accurate melanoma detection in real-world settings that can run on consumer-level hardware.
comment: 10 pages, 9 figures
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by final loss and language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ Finding the right XAI method -- A Guide for the Evaluation and Ranking of Explainable AI Methods in Climate Science
Explainable artificial intelligence (XAI) methods shed light on the predictions of machine learning algorithms. Several different approaches exist and have already been applied in climate science. However, usually missing ground truth explanations complicate their evaluation and comparison, subsequently impeding the choice of the XAI method. Therefore, in this work, we introduce XAI evaluation in the climate context and discuss different desired explanation properties, namely robustness, faithfulness, randomization, complexity, and localization. To this end, we chose previous work as a case study where the decade of annual-mean temperature maps is predicted. After training both a multi-layer perceptron (MLP) and a convolutional neural network (CNN), multiple XAI methods are applied and their skill scores in reference to a random uniform explanation are calculated for each property. Independent of the network, we find that XAI methods Integrated Gradients, layer-wise relevance propagation, and input times gradients exhibit considerable robustness, faithfulness, and complexity while sacrificing randomization performance. Sensitivity methods -- gradient, SmoothGrad, NoiseGrad, and FusionGrad, match the robustness skill but sacrifice faithfulness and complexity for randomization skill. We find architecture-dependent performance differences regarding robustness, complexity and localization skills of different XAI methods, highlighting the necessity for research task-specific evaluation. Overall, our work offers an overview of different evaluation properties in the climate science context and shows how to compare and benchmark different explanation methods, assessing their suitability based on strengths and weaknesses, for the specific research problem at hand. By that, we aim to support climate researchers in the selection of a suitable XAI method.
comment: 19 pages, 10 figure, accepted at AIES journal by AMS
♻ ☆ Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
comment: Project page at https://videoshop-editing.github.io/
♻ ☆ Empowering Autonomous Driving with Large Language Models: A Safety Perspective ICLR2024
Autonomous Driving (AD) encounters significant safety hurdles in long-tail unforeseen driving scenarios, largely stemming from the non-interpretability and poor generalization of the deep neural networks within the AD system, particularly in out-of-distribution and uncertain data. To this end, this paper explores the integration of Large Language Models (LLMs) into AD systems, leveraging their robust common-sense knowledge and reasoning abilities. The proposed methodologies employ LLMs as intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning, for enhancing driving performance and safety. We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine. Demonstrating superior performance and safety metrics compared to state-of-the-art approaches, our approach shows the promising potential for using LLMs for autonomous vehicles.
comment: Accepted to LLMAgent workshop @ICLR2024
♻ ☆ From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity
In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset. This structure is given in terms of signed volumes of triangles and parallelotopes generated by data vectors. The convex problem finds a small subset of samples via $\ell_1$ regularization to discover only relevant wedge product features. Our analysis provides a novel perspective on the inner workings of deep neural networks and sheds light on the role of the hidden layers.
♻ ☆ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
♻ ☆ Attractor reconstruction with reservoir computers: The effect of the reservoir's conditional Lyapunov exponents on faithful attractor reconstruction
Reservoir computing is a machine learning framework that has been shown to be able to replicate the chaotic attractor, including the fractal dimension and the entire Lyapunov spectrum, of the dynamical system on which it is trained. We quantitatively relate the generalized synchronization dynamics of a driven reservoir during the training stage to the performance of the trained reservoir computer at the attractor reconstruction task. We show that, in order to obtain successful attractor reconstruction and Lyapunov spectrum estimation, the largest conditional Lyapunov exponent of the driven reservoir must be significantly more negative than the most negative Lyapunov exponent of the target system. We also find that the maximal conditional Lyapunov exponent of the reservoir depends strongly on the spectral radius of the reservoir adjacency matrix, and therefore, for attractor reconstruction and Lyapunov spectrum estimation, small spectral radius reservoir computers perform better in general. Our arguments are supported by numerical examples on well-known chaotic systems.
♻ ☆ Learning to Predict Structural Vibrations
In mechanical structures like airplanes, cars and houses, noise is generated and transmitted through vibrations. To take measures to reduce this noise, vibrations need to be simulated with expensive numerical computations. Surrogate deep learning models present a promising alternative to classical numerical simulations as they can be evaluated magnitudes faster, while trading-off accuracy. To quantify such trade-offs systematically and foster the development of methods, we present a benchmark on the task of predicting the vibration of harmonically excited plates. The benchmark features a total of 12000 plate geometries with varying forms of beadings, material and sizes with associated numerical solutions. To address the benchmark task, we propose a new network architecture, named Frequency-Query Operator, which is trained to map plate geometries to their vibration pattern given a specific excitation frequency. Applying principles from operator learning and implicit models for shape encoding, our approach effectively addresses the prediction of highly variable frequency response functions occurring in dynamic systems. To quantify the prediction quality, we introduce a set of evaluation metrics and evaluate the method on our vibrating-plates benchmark. Our method outperforms DeepONets, Fourier Neural Operators and more traditional neural network architectures. Code, dataset and visualizations: https://eckerlab.org/code/delden2023_plate
♻ ☆ Learning High-level Semantic-Relational Concepts for SLAM
Recent works on SLAM extend their pose graphs with higher-level semantic concepts like Rooms exploiting relationships between them, to provide, not only a richer representation of the situation/environment but also to improve the accuracy of its estimation. Concretely, our previous work, Situational Graphs (S-Graphs+), a pioneer in jointly leveraging semantic relationships in the factor optimization process, relies on semantic entities such as Planes and Rooms, whose relationship is mathematically defined. Nevertheless, there is no unique approach to finding all the hidden patterns in lower-level factor-graphs that correspond to high-level concepts of different natures. It is currently tackled with ad-hoc algorithms, which limits its graph expressiveness. To overcome this limitation, in this work, we propose an algorithm based on Graph Neural Networks for learning high-level semantic-relational concepts that can be inferred from the low-level factor graph. Given a set of mapped Planes our algorithm is capable of inferring Room entities relating to the Planes. Additionally, to demonstrate the versatility of our method, our algorithm can infer an additional semantic-relational concept, i.e. Wall, and its relationship with its Planes. We validate our method in both simulated and real datasets demonstrating improved performance over two baseline approaches. Furthermore, we integrate our method into the S-Graphs+ algorithm providing improved pose and map accuracy compared to the baseline while further enhancing the scene representation.
♻ ☆ Novelty Detection in Reinforcement Learning with World Models
Reinforcement learning (RL) using world models has found significant recent successes. However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline. We refer to the sudden change in visual properties or state transitions as novelties. Implementing novelty detection within generated world model frameworks is a crucial task for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents, by utilizing the misalignment of the world model's hallucinated states and the true observed states as an anomaly score. We provide effective approaches to detecting novelties in a distribution of transitions learned by an agent in a world model. Finally, we show the advantage of our work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RL focused novelty detection algorithms.
♻ ☆ Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision latency compared to existing naive kernels for 1-D and 2-D neighborhood attention respectively. We find certain inherent inefficiencies in all unfused neighborhood attention kernels that bound their performance and lower-precision scalability. We also developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision latency. We observe that our fused kernels successfully circumvent some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 496% and 113% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1607% and 581% in 1-D and 2-D problems respectively.
comment: Project page: https://github.com/SHI-Labs/NATTEN
♻ ☆ Unveiling Group-Specific Distributed Concept Drift: A Fairness Imperative in Federated Learning
In the evolving field of machine learning, ensuring fairness has become a critical concern, prompting the development of algorithms designed to mitigate discriminatory outcomes in decision-making processes. However, achieving fairness in the presence of group-specific concept drift remains an unexplored frontier, and our research represents pioneering efforts in this regard. Group-specific concept drift refers to situations where one group experiences concept drift over time while another does not, leading to a decrease in fairness even if accuracy remains fairly stable. Within the framework of federated learning, where clients collaboratively train models, its distributed nature further amplifies these challenges since each client can experience group-specific concept drift independently while still sharing the same underlying concept, creating a complex and dynamic environment for maintaining fairness. One of the significant contributions of our research is the formalization and introduction of the problem of group-specific concept drift and its distributed counterpart, shedding light on its critical importance in the realm of fairness. In addition, leveraging insights from prior research, we adapt an existing distributed concept drift adaptation algorithm to tackle group-specific distributed concept drift which utilizes a multi-model approach, a local group-specific drift detection mechanism, and continuous clustering of models over time. The findings from our experiments highlight the importance of addressing group-specific concept drift and its distributed counterpart to advance fairness in machine learning.
♻ ☆ Recurrent Drafter for Fast Speculative Decoding in Large Language Models
In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.
comment: 11 pages, 6 figures
♻ ☆ Large Language Model-informed ECG Dual Attention Network for Heart Failure Risk Prediction
Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. We introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual-attention ECG network designed to capture complex ECG features essential for early HF risk prediction, despite the notable imbalance between low and high-risk groups. This network incorporates a cross-lead attention module and twelve lead-specific temporal attention modules, focusing on cross-lead interactions and each lead's local dynamics. To further alleviate model overfitting, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the UK Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI).The results reveal that LLM-informed pre-training substantially enhances HF risk prediction in these cohorts. The dual-attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method's potential in advancing HF risk assessment with clinical complex ECG data.
comment: Under journal revision
♻ ☆ Multi-resolution Time-Series Transformer for Long-term Forecasting
The performance of transformers for time-series forecasting has improved significantly. Recent architectures learn complex temporal patterns by segmenting a time-series into patches and using the patches as tokens. The patch size controls the ability of transformers to learn the temporal patterns at different frequencies: shorter patches are effective for learning localized, high-frequency patterns, whereas mining long-term seasonalities and trends requires longer patches. Inspired by this observation, we propose a novel framework, Multi-resolution Time-Series Transformer (MTST), which consists of a multi-branch architecture for simultaneous modeling of diverse temporal patterns at different resolutions. In contrast to many existing time-series transformers, we employ relative positional encoding, which is better suited for extracting periodic components at different scales. Extensive experiments on several real-world datasets demonstrate the effectiveness of MTST in comparison to state-of-the-art forecasting techniques.
♻ ☆ Residual Denoising Diffusion Models CVPR2024
We propose residual denoising diffusion models (RDDM), a novel dual diffusion process that decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. This dual diffusion framework expands the denoising-based diffusion models, initially uninterpretable for image restoration, into a unified and interpretable model for both image generation and restoration by introducing residuals. Specifically, our residual diffusion represents directional diffusion from the target image to the degraded input image and explicitly guides the reverse generation process for image restoration, while noise diffusion represents random perturbations in the diffusion process. The residual prioritizes certainty, while the noise emphasizes diversity, enabling RDDM to effectively unify tasks with varying certainty or diversity requirements, such as image generation and restoration. We demonstrate that our sampling process is consistent with that of DDPM and DDIM through coefficient transformation, and propose a partially path-independent generation process to better understand the reverse process. Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a batch size of 1, to compete with state-of-the-art image restoration methods. We provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/nachifur/RDDM).
comment: Accepted to CVPR2024
♻ ☆ Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech ICASSP 2024
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.
comment: To appear in the proceedings of ICASSP 2024
♻ ☆ Combining the Strengths of Dutch Survey and Register Data in a Data Challenge to Predict Fertility (PreFer)
The social sciences have produced an impressive body of research on determinants of fertility outcomes, or whether and when people have children. However, the strength of these determinants and underlying theories are rarely evaluated on their predictive ability on new data. This prevents us from systematically comparing studies, hindering the evaluation and accumulation of knowledge. In this paper, we present two datasets which can be used to study the predictability of fertility outcomes in the Netherlands. One dataset is based on the LISS panel, a longitudinal survey which includes thousands of variables on a wide range of topics, including individual preferences and values. The other is based on the Dutch register data which lacks attitudinal data but includes detailed information about the life courses of millions of Dutch residents. We provide information about the datasets and the samples, and describe the fertility outcome of interest. We also introduce the fertility prediction data challenge PreFer which is based on these datasets and will start in Spring 2024. We outline the ways in which measuring the predictability of fertility outcomes using these datasets and combining their strengths in the data challenge can advance our understanding of fertility behaviour and computational social science. We further provide details for participants on how to take part in the data challenge.
♻ ☆ Sparse Mean Field Load Balancing in Large Localized Queueing Systems
Scalable load balancing algorithms are of great interest in cloud networks and data centers, necessitating the use of tractable techniques to compute optimal load balancing policies for good performance. However, most existing scalable techniques, especially asymptotically scaling methods based on mean field theory, have not been able to model large queueing networks with strong locality. Meanwhile, general multi-agent reinforcement learning techniques can be hard to scale and usually lack a theoretical foundation. In this work, we address this challenge by leveraging recent advances in sparse mean field theory to learn a near-optimal load balancing policy in sparsely connected queueing networks in a tractable manner, which may be preferable to global approaches in terms of wireless communication overhead. Importantly, we obtain a general load balancing framework for a large class of sparse bounded-degree wireless topologies. By formulating a novel mean field control problem in the context of graphs with bounded degree, we reduce the otherwise difficult multi-agent problem to a single-agent problem. Theoretically, the approach is justified by approximation guarantees. Empirically, the proposed methodology performs well on several realistic and scalable wireless network topologies as compared to a number of well-known load balancing heuristics and existing scalable multi-agent reinforcement learning methods.
♻ ☆ Task-Oriented GNNs Training on Large Knowledge Graphs for Accurate and Efficient Modeling ICDE
A Knowledge Graph (KG) is a heterogeneous graph encompassing a diverse range of node and edge types. Heterogeneous Graph Neural Networks (HGNNs) are popular for training machine learning tasks like node classification and link prediction on KGs. However, HGNN methods exhibit excessive complexity influenced by the KG's size, density, and the number of node and edge types. AI practitioners handcraft a subgraph of a KG G relevant to a specific task. We refer to this subgraph as a task-oriented subgraph (TOSG), which contains a subset of task-related node and edge types in G. Training the task using TOSG instead of G alleviates the excessive computation required for a large KG. Crafting the TOSG demands a deep understanding of the KG's structure and the task's objectives. Hence, it is challenging and time-consuming. This paper proposes KG-TOSA, an approach to automate the TOSG extraction for task-oriented HGNN training on a large KG. In KG-TOSA, we define a generic graph pattern that captures the KG's local and global structure relevant to a specific task. We explore different techniques to extract subgraphs matching our graph pattern: namely (i) two techniques sampling around targeted nodes using biased random walk or influence scores, and (ii) a SPARQL-based extraction method leveraging RDF engines' built-in indices. Hence, it achieves negligible preprocessing overhead compared to the sampling techniques. We develop a benchmark of real KGs of large sizes and various tasks for node classification and link prediction. Our experiments show that KG-TOSA helps state-of-the-art HGNN methods reduce training time and memory usage by up to 70% while improving the model performance, e.g., accuracy and inference time.
comment: 12 pages,9 Figures, 3 Tables, ICDE:2024
♻ ☆ An axiomatized PDE model of deep neural networks
Inspired by the relation between deep neural network (DNN) and partial differential equations (PDEs), we study the general form of the PDE models of deep neural networks. To achieve this goal, we formulate DNN as an evolution operator from a simple base model. Based on several reasonable assumptions, we prove that the evolution operator is actually determined by convection-diffusion equation. This convection-diffusion equation model gives mathematical explanation for several effective networks. Moreover, we show that the convection-diffusion model improves the robustness and reduces the Rademacher complexity. Based on the convection-diffusion equation, we design a new training method for ResNets. Experiments validate the performance of the proposed method.
comment: The experiment design in the paper lacks careful thought and may be misleading in demonstrating our contribution
♻ ☆ Fast Nonlinear Two-Time-Scale Stochastic Approximation: Achieving $O(1/k)$ Finite-Sample Complexity
This paper proposes to develop a new variant of the two-time-scale stochastic approximation to find the roots of two coupled nonlinear operators, assuming only noisy samples of these operators can be observed. Our key idea is to leverage the classic Ruppert-Polyak averaging technique to dynamically estimate the operators through their samples. The estimated values of these averaging steps will then be used in the two-time-scale stochastic approximation updates to find the desired solution. Our main theoretical result is to show that under the strongly monotone condition of the underlying nonlinear operators the mean-squared errors of the iterates generated by the proposed method converge to zero at an optimal rate $O(1/k)$, where $k$ is the number of iterations. Our result significantly improves the existing result of two-time-scale stochastic approximation, where the best known finite-time convergence rate is $O(1/k^{2/3})$. We illustrate this result by applying the proposed method to develop new reinforcement learning algorithms with improved performance.
♻ ☆ KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science
In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. In this process, a myriad of artifacts (datasets, pipeline scripts, etc.) are created. However, there has been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists recover information and expertise from colleagues or learn via trial and error. Hence, this paper presents a scalable platform, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information, KGLiDS enables various downstream applications, such as data discovery and pipeline automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML. It shows that KGLiDS is significantly faster with a lower memory footprint than the state-of-the-art systems while achieving comparable or better accuracy.
comment: 15 pages, 9 figures
♻ ☆ Interpretability Guarantees with Merlin-Arthur Classifiers AISTATS24
We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
comment: AISTATS24 Camera-Ready Version, 34 pages total (9 pages main part, 3 pages references, 22 pages appendix), 17 figures, 3 tables
♻ ☆ Multi-conditioned Graph Diffusion for Neural Architecture Search
Neural architecture search automates the design of neural network architectures usually by exploring a large and thus complex architecture search space. To advance the architecture search, we present a graph diffusion-based NAS approach that uses discrete conditional graph diffusion processes to generate high-performing neural network architectures. We then propose a multi-conditioned classifier-free guidance approach applied to graph diffusion networks to jointly impose constraints such as high accuracy and low hardware latency. Unlike the related work, our method is completely differentiable and requires only a single model training. In our evaluations, we show promising results on six standard benchmarks, yielding novel and unique architectures at a fast speed, i.e. less than 0.2 seconds per architecture. Furthermore, we demonstrate the generalisability and efficiency of our method through experiments on ImageNet dataset.
comment: Accepted at Transactions on Machine Learning Research (TMLR)
♻ ☆ End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear Model Predictive Control
(Economic) nonlinear model predictive control ((e)NMPC) requires dynamic models that are sufficiently accurate and computationally tractable. Data-driven surrogate models for mechanistic models can reduce the computational burden of (e)NMPC; however, such models are typically trained by system identification for maximum prediction accuracy on simulation samples and perform suboptimally in (e)NMPC. We present a method for end-to-end reinforcement learning of Koopman surrogate models for optimal performance as part of (e)NMPC. We apply our method to two applications derived from an established nonlinear continuous stirred-tank reactor model. The controller performance is compared to that of (e)NMPCs utilizing models trained using system identification, and model-free neural network controllers trained using reinforcement learning. We show that the end-to-end trained models outperform those trained using system identification in (e)NMPC, and that, in contrast to the neural network controllers, the (e)NMPC controllers can react to changes in the control setting without retraining.
comment: manuscript (18 pages, 7 figures, 5 tables), supplementary materials (3 pages, 2 tables)
♻ ☆ Model Uncertainty in Evolutionary Optimization and Bayesian Optimization: A Comparative Analysis
Black-box optimization problems, which are common in many real-world applications, require optimization through input-output interactions without access to internal workings. This often leads to significant computational resources being consumed for simulations. Bayesian Optimization (BO) and Surrogate-Assisted Evolutionary Algorithm (SAEA) are two widely used gradient-free optimization techniques employed to address such challenges. Both approaches follow a similar iterative procedure that relies on surrogate models to guide the search process. This paper aims to elucidate the similarities and differences in the utilization of model uncertainty between these two methods, as well as the impact of model inaccuracies on algorithmic performance. A novel model-assisted strategy is introduced, which utilizes unevaluated solutions to generate offspring, leveraging the population-based search capabilities of evolutionary algorithm to enhance the effectiveness of model-assisted optimization. Experimental results demonstrate that the proposed approach outperforms mainstream Bayesian optimization algorithms in terms of accuracy and efficiency.
♻ ☆ An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression ICLR
We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an "agnostic" view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or the target is outside the RKHS. We analyze the cost of overfitting under a Gaussian universality ansatz using recently derived (non-rigorous) risk estimates in terms of the task eigenstructure. Our analysis provides a more refined characterization of benign, tempered and catastrophic overfitting (cf. Mallinar et al. 2022).
comment: This is the ICLR CR version
♻ ☆ Learning to Embed Time Series Patches Independently ICLR 2024
Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series. Inspired by masked image modeling in computer vision, recent works first patchify and partially mask out time series, and then train Transformers to capture the dependencies between patches by predicting masked patches from unmasked patches. However, we argue that capturing such patch dependencies might not be an optimal strategy for time series representation learning; rather, learning to embed patches independently results in better time series representations. Specifically, we propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise MLP that embeds each patch independently. In addition, we introduce complementary contrastive learning to hierarchically capture adjacent time series information efficiently. Our proposed method improves time series forecasting and classification performance compared to state-of-the-art Transformer-based models, while it is more efficient in terms of the number of parameters and training/inference time. Code is available at this repository: https://github.com/seunghan96/pits.
comment: ICLR 2024
♻ ☆ Soft Contrastive Learning for Time Series ICLR 2024
Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.
comment: ICLR 2024 Spotlight
♻ ☆ CPA-Enhancer: Chain-of-Thought Prompted Adaptive Enhancer for Object Detection under Unknown Degradations
Object detection methods under known single degradations have been extensively investigated. However, existing approaches require prior knowledge of the degradation type and train a separate model for each, limiting their practical applications in unpredictable environments. To address this challenge, we propose a chain-of-thought (CoT) prompted adaptive enhancer, CPA-Enhancer, for object detection under unknown degradations. Specifically, CPA-Enhancer progressively adapts its enhancement strategy under the step-by-step guidance of CoT prompts, that encode degradation-related information. To the best of our knowledge, it's the first work that exploits CoT prompting for object detection tasks. Overall, CPA-Enhancer is a plug-and-play enhancement model that can be integrated into any generic detectors to achieve substantial gains on degraded images, without knowing the degradation type priorly. Experimental results demonstrate that CPA-Enhancer not only sets the new state of the art for object detection but also boosts the performance of other downstream vision tasks under unknown degradations.
♻ ☆ Accurately Predicting Probabilities of Safety-Critical Rare Events for Intelligent Systems
Intelligent systems are increasingly integral to our daily lives, yet rare safety-critical events present significant latent threats to their practical deployment. Addressing this challenge hinges on accurately predicting the probability of safety-critical events occurring within a given time step from the current state, a metric we define as 'criticality'. The complexity of predicting criticality arises from the extreme data imbalance caused by rare events in high dimensional variables associated with the rare events, a challenge we refer to as the curse of rarity. Existing methods tend to be either overly conservative or prone to overlooking safety-critical events, thus struggling to achieve both high precision and recall rates, which severely limits their applicability. This study endeavors to develop a criticality prediction model that excels in both precision and recall rates for evaluating the criticality of safety-critical autonomous systems. We propose a multi-stage learning framework designed to progressively densify the dataset, mitigating the curse of rarity across stages. To validate our approach, we evaluate it in two cases: lunar lander and bipedal walker scenarios. The results demonstrate that our method surpasses traditional approaches, providing a more accurate and dependable assessment of criticality in intelligent systems.
♻ ☆ EC-NAS: Energy Consumption Aware Tabular Benchmarks for Neural Architecture Search ICASSP-2024
Energy consumption from the selection, training, and deployment of deep learning models has seen a significant uptick recently. This work aims to facilitate the design of energy-efficient deep learning models that require less computational resources and prioritize environmental sustainability by focusing on the energy consumption. Neural architecture search (NAS) benefits from tabular benchmarks, which evaluate NAS strategies cost-effectively through precomputed performance statistics. We advocate for including energy efficiency as an additional performance criterion in NAS. To this end, we introduce an enhanced tabular benchmark encompassing data on energy consumption for varied architectures. The benchmark, designated as EC-NAS, has been made available in an open-source format to advance research in energy-conscious NAS. EC-NAS incorporates a surrogate model to predict energy consumption, aiding in diminishing the energy expenditure of the dataset creation. Our findings emphasize the potential of EC-NAS by leveraging multi-objective optimization algorithms, revealing a balance between energy usage and accuracy. This suggests the feasibility of identifying energy-lean architectures with little or no compromise in performance.
comment: Accepted to be presented at the International Conference on Acoustics, Speech and Signal Processing (ICASSP-2024). Source code at https://github.com/saintslab/EC-NAS-Bench
♻ ☆ LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept of this approach. While such a proof-of-concept has shown promise, it still has significant limitations. State-of-the-art polyhedral compilers that use a deep-learning cost model only support a small subset of affine transformations, limiting their ability to apply complex code transformations. They also only support simple programs that have a single loop nest and a rectangular iteration domain, limiting their applicability to many programs. These limitations significantly impact the generality of such compilers and autoschedulers and put into question the whole approach. In this paper, we introduce LOOPer, the first polyhedral autoscheduler that uses a deep-learning based cost model and covers a large set of affine transformations and programs. It supports the exploration of a large set of affine transformations, allowing the application of complex sequences of polyhedral transformations. It also supports the optimization of programs with multiple loop nests and with rectangular and non-rectangular iteration domains, allowing the optimization of an extensive set of programs. We implement and evaluate LOOPer and show that it achieves speedups over the state-of-the-art. On the Polybench benchmark, LOOPer achieves a geometric mean speedup of 1.59x over Tiramisu. LOOPer also achieves competitive speedups with a geometric mean speedup of 1.34x over Pluto, a state-of-the-art polyhedral compiler that does not use a machine-learning based cost model.
♻ ☆ Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete
We consider the problem of finding weights and biases for a two-layer fully connected neural network to fit a given set of data points as well as possible, also known as EmpiricalRiskMinimization. Our main result is that the associated decision problem is $\exists\mathbb{R}$-complete, that is, polynomial-time equivalent to determining whether a multivariate polynomial with integer coefficients has any real roots. Furthermore, we prove that algebraic numbers of arbitrarily large degree are required as weights to be able to train some instances to optimality, even if all data points are rational. Our result already applies to fully connected instances with two inputs, two outputs, and one hidden layer of ReLU neurons. Thereby, we strengthen a result by Abrahamsen, Kleist and Miltzow [NeurIPS 2021]. A consequence of this is that a combinatorial search algorithm like the one by Arora, Basu, Mianjy and Mukherjee [ICLR 2018] is impossible for networks with more than one output dimension, unless $\mathsf{NP}=\exists\mathbb{R}$.
comment: 39 pages, 17 figures. Changes in version 2: Added algebraic universality result, improved interpretation of results Changes in version 3: Improved exposition by formalizing properties of gadgets
♻ ☆ Similarity-based Label Inference Attack against Training and Inference of Split Learning
Split learning is a promising paradigm for privacy-preserving distributed learning. The learning model can be cut into multiple portions to be collaboratively trained at the participants by exchanging only the intermediate results at the cut layer. Understanding the security performance of split learning is critical for many privacy-sensitive applications. This paper shows that the exchanged intermediate results, including the smashed data (i.e., extracted features from the raw data) and gradients during training and inference of split learning, can already reveal the private labels. We mathematically analyze the potential label leakages and propose the cosine and Euclidean similarity measurements for gradients and smashed data, respectively. Then, the two similarity measurements are shown to be unified in Euclidean space. Based on the similarity metric, we design three label inference attacks to efficiently recover the private labels during both the training and inference phases. Experimental results validate that the proposed approaches can achieve close to 100% accuracy of label attacks. The proposed attack can still achieve accurate predictions against various state-of-the-art defense mechanisms, including DP-SGD, label differential privacy, gradient compression, and Marvell.
♻ ☆ Multivariate Scenario Generation of Day-Ahead Electricity Prices using Normalizing Flows
Trading on the day-ahead electricity markets requires accurate information about the realization of electricity prices and the uncertainty attached to the predictions. Deriving accurate forecasting models presents a difficult task due to the day-ahead price's non-stationarity resulting from changing market conditions, e.g., due to changes resulting from the energy crisis in 2021. We present a probabilistic forecasting approach for day-ahead electricity prices using the fully data-driven deep generative model called normalizing flow. Our modeling approach generates full-day scenarios of day-ahead electricity prices based on conditional features such as residual load forecasts. Furthermore, we propose extended feature sets of prior realizations and a periodic retraining scheme that allows the normalizing flow to adapt to the changing conditions of modern electricity markets. Our results highlight that the normalizing flow generates high-quality scenarios that reproduce the true price distribution and yield accurate forecasts. Additionally, our analysis highlights how our improvements towards adaptations in changing regimes allow the normalizing flow to adapt to changing market conditions and enable continued sampling of high-quality day-ahead price scenarios.
comment: 17 pages, 9 figures
♻ ☆ Cross-domain Random Pre-training with Prototypes for Reinforcement Learning
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised cross-domain Reinforcement Learning (RL) pre-training shows great potential for challenging continuous visual control but poses a big challenge. In this paper, we propose \textbf{C}ross-domain \textbf{R}andom \textbf{P}re-\textbf{T}raining with \textbf{pro}totypes (CRPTpro), a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. CRPTpro decouples data sampling from encoder pre-training, proposing decoupled random collection to easily and quickly generate a qualified cross-domain pre-training dataset. Moreover, a novel prototypical self-supervised algorithm is proposed to pre-train an effective visual encoder that is generic across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream tasks defined in different domains, either seen or unseen. Compared with recent advanced methods, CRPTpro achieves better performance on downstream policy learning without extra training on exploration agents for data collection, greatly reducing the burden of pre-training. We conduct extensive experiments across eight challenging continuous visual-control domains, including balance control, robot locomotion, and manipulation. CRPTpro significantly outperforms the next best Proto-RL(C) on 11/12 cross-domain downstream tasks with only 54\% wall-clock pre-training time, exhibiting state-of-the-art pre-training performance with greatly improved pre-training efficiency. The complete code is available at https://github.com/liuxin0824/CRPTpro.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization
"Forward-only" algorithms, which train neural networks while avoiding a backward pass, have recently gained attention as a way of solving the biologically unrealistic aspects of backpropagation. Here, we first address compelling challenges related to the "forward-only" rules, which include reducing the performance gap with backpropagation and providing an analytical understanding of their dynamics. To this end, we show that the forward-only algorithm with top-down feedback is well-approximated by an "adaptive-feedback-alignment" algorithm, and we analytically track its performance during learning in a prototype high-dimensional setting. Then, we compare different versions of forward-only algorithms, focusing on the Forward-Forward and PEPITA frameworks, and we show that they share the same learning principles. Overall, our work unveils the connections between three key neuro-inspired learning rules, providing a link between "forward-only" algorithms, i.e., Forward-Forward and PEPITA, and an approximation of backpropagation, i.e., Feedback Alignment.
♻ ☆ Quantum Langevin Dynamics for Optimization
We initiate the study of utilizing Quantum Langevin Dynamics (QLD) to solve optimization problems, particularly those non-convex objective functions that present substantial obstacles for traditional gradient descent algorithms. Specifically, we examine the dynamics of a system coupled with an infinite heat bath. This interaction induces both random quantum noise and a deterministic damping effect to the system, which nudge the system towards a steady state that hovers near the global minimum of objective functions. We theoretically prove the convergence of QLD in convex landscapes, demonstrating that the average energy of the system can approach zero in the low temperature limit with an exponential decay rate correlated with the evolution time. Numerically, we first show the energy dissipation capability of QLD by retracing its origins to spontaneous emission. Furthermore, we conduct detailed discussion of the impact of each parameter. Finally, based on the observations when comparing QLD with classical Fokker-Plank-Smoluchowski equation, we propose a time-dependent QLD by making temperature and $\hbar$ time-dependent parameters, which can be theoretically proven to converge better than the time-independent case and also outperforms a series of state-of-the-art quantum and classical optimization algorithms in many non-convex landscapes.
comment: 52 pages, 1 table, 25 figures
♻ ☆ E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the information entropy of hidden state features into a pruning metric design, namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse employs the information richness to leverage the channel importance, and further incorporates several novel techniques to put it into effect: (1) it introduces information entropy to enhance the significance of parameter weights and input feature norms as a novel pruning metric, and performs N:M sparsity without modifying the remaining weights. (2) it designs global naive shuffle and local block shuffle to quickly optimize the information distribution and adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere GPUs. Extensive experiments on the LLaMA family and OPT models show that E-Sparse can significantly speed up the model inference over the dense model (up to 1.53X) and obtain significant memory saving (up to 43.52%), with acceptable accuracy loss.
♻ ☆ ToonAging: Face Re-Aging upon Artistic Portrait Style Transfer
Face re-aging is a prominent field in computer vision and graphics, with significant applications in photorealistic domains such as movies, advertising, and live streaming. Recently, the need to apply face re-aging to non-photorealistic images, like comics, illustrations, and animations, has emerged as an extension in various entertainment sectors. However, the lack of a network that can seamlessly edit the apparent age in NPR images has limited these tasks to a naive, sequential approach. This often results in unpleasant artifacts and a loss of facial attributes due to domain discrepancies. In this paper, we introduce a novel one-stage method for face re-aging combined with portrait style transfer, executed in a single generative step. We leverage existing face re-aging and style transfer networks, both trained within the same PR domain. Our method uniquely fuses distinct latent vectors, each responsible for managing aging-related attributes and NPR appearance. By adopting an exemplar-based approach, our method offers greater flexibility compared to domain-level fine-tuning approaches, which typically require separate training or fine-tuning for each domain. This effectively addresses the limitation of requiring paired datasets for re-aging and domain-level, data-driven approaches for stylization. Our experiments show that our model can effortlessly generate re-aged images while simultaneously transferring the style of examples, maintaining both natural appearance and controllability.
comment: 14 pages, 15 figures, 1 table
♻ ☆ BBE-LSWCM: A Bootstrapped Ensemble of Long and Short Window Clickstream Models
We consider the problem of developing a clickstream modeling framework for real-time customer event prediction problems in SaaS products like QBO. We develop a low-latency, cost-effective, and robust ensemble architecture (BBE-LSWCM), which combines both aggregated user behavior data from a longer historical window (e.g., over the last few weeks) as well as user activities over a short window in recent-past (e.g., in the current session). As compared to other baseline approaches, we demonstrate the superior performance of the proposed method for two important real-time event prediction problems: subscription cancellation and intended task detection for QBO subscribers. Finally, we present details of the live deployment and results from online experiments in QBO.
comment: 9 pages
♻ ☆ ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Self-attention is an essential component of large language models(LLMs) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$\times$ compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.
♻ ☆ A Benchmark Study on Calibration ICLR 2024
Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS. The project page can be found at https://www.taolinwei.com/calibration-study
comment: ICLR 2024 poster
♻ ☆ A Temporally Disentangled Contrastive Diffusion Model for Spatiotemporal Imputation
Spatiotemporal data analysis is pivotal across various domains, such as transportation, meteorology, and healthcare. The data collected in real-world scenarios are often incomplete due to device malfunctions and network errors. Spatiotemporal imputation aims to predict missing values by exploiting the spatial and temporal dependencies in the observed data. Traditional imputation approaches based on statistical and machine learning techniques require the data to conform to their distributional assumptions, while graph and recurrent neural networks are prone to error accumulation problems due to their recurrent structures. Generative models, especially diffusion models, can potentially circumvent the reliance on inaccurate, previously imputed values for future predictions; However, diffusion models still face challenges in generating stable results. We propose to address these challenges by designing conditional information to guide the generative process and expedite the training process. We introduce a conditional diffusion framework called C$^2$TSD, which incorporates disentangled temporal (trend and seasonality) representations as conditional information and employs contrastive learning to improve generalizability. Our extensive experiments on three real-world datasets demonstrate the superior performance of our approach compared to a number of state-of-the-art baselines.
♻ ☆ Generalisable Agents for Neural Network Optimisation NeurIPS 2023
Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO) -- a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome.
comment: Accepted at the Workshop on Advanced Neural Network Training (WANT) and Optimization for Machine Learning (OPT) at NeurIPS 2023
♻ ☆ Learning to Importance Sample in Primary Sample Space
Importance sampling is one of the most widely used variance reduction strategies in Monte Carlo rendering. In this paper, we propose a novel importance sampling technique that uses a neural network to learn how to sample from a desired density represented by a set of samples. Our approach considers an existing Monte Carlo rendering algorithm as a black box. During a scene-dependent training phase, we learn to generate samples with a desired density in the primary sample space of the rendering algorithm using maximum likelihood estimation. We leverage a recent neural network architecture that was designed to represent real-valued non-volume preserving ('Real NVP') transformations in high dimensional spaces. We use Real NVP to non-linearly warp primary sample space and obtain desired densities. In addition, Real NVP efficiently computes the determinant of the Jacobian of the warp, which is required to implement the change of integration variables implied by the warp. A main advantage of our approach is that it is agnostic of underlying light transport effects, and can be combined with many existing rendering techniques by treating them as a black box. We show that our approach leads to effective variance reduction in several practical scenarios.
comment: 11 pages, 14 figure; authors' version, the definitive version of record is available at https://onlinelibrary.wiley.com/doi/10.1111/cgf.13628
♻ ☆ Statistical Agnostic Regression: a machine learning method to validate regression models
Regression analysis is a central topic in statistical modeling, aiming to estimate the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in several fields of research, such as prediction, forecasting, or causal inference. Beyond various classical methods to solve linear regression problems, such as Ordinary Least Squares, Ridge, or Lasso regressions - which are often the foundation for more advanced machine learning (ML) techniques - the latter have been successfully applied in this scenario without a formal definition of statistical significance. At most, permutation or classical analyses based on empirical measures (e.g., residuals or accuracy) have been conducted to reflect the greater ability of ML estimations for detection. In this paper, we introduce a method, named Statistical Agnostic Regression (SAR), for evaluating the statistical significance of an ML-based linear regression based on concentration inequalities of the actual risk using the analysis of the worst case. To achieve this goal, similar to the classification problem, we define a threshold to establish that there is sufficient evidence with a probability of at least 1-eta to conclude that there is a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations in only two dimensions demonstrate the ability of the proposed agnostic test to provide a similar analysis of variance given by the classical $F$ test for the slope parameter.
comment: 17 pages, 15 figures
♻ ☆ mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
comment: Working in Process
♻ ☆ ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection
Out-of-distribution (OOD) detection methods often exploit auxiliary outliers to train model identifying OOD samples, especially discovering challenging outliers from auxiliary outliers dataset to improve OOD detection. However, they may still face limitations in effectively distinguishing between the most challenging OOD samples that are much like in-distribution (ID) data, i.e., \idlike samples. To this end, we propose a novel OOD detection framework that discovers \idlike outliers using CLIP \cite{DBLP:conf/icml/RadfordKHRGASAM21} from the vicinity space of the ID samples, thus helping to identify these most challenging OOD samples. Then a prompt learning framework is proposed that utilizes the identified \idlike outliers to further leverage the capabilities of CLIP for OOD detection. Benefiting from the powerful CLIP, we only need a small number of ID samples to learn the prompts of the model without exposing other auxiliary outlier datasets. By focusing on the most challenging \idlike OOD samples and elegantly exploiting the capabilities of CLIP, our method achieves superior few-shot learning performance on various real-world image datasets (e.g., in 4-shot OOD detection on the ImageNet-1k dataset, our method reduces the average FPR95 by 12.16\% and improves the average AUROC by 2.76\%, compared to state-of-the-art methods). Code is available at https://github.com/ycfate/ID-like.
♻ ☆ LSKNet: A Foundation Lightweight Backbone for Remote Sensing
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.
comment: arXiv admin note: substantial text overlap with arXiv:2303.09030
♻ ☆ Decision-making with Speculative Opponent Models
Opponent modelling has proven effective in enhancing the decision-making of the controlled agent by constructing models of opponent agents. However, existing methods often rely on access to the observations and actions of opponents, a requirement that is infeasible when such information is either unobservable or challenging to obtain. To address this issue, we introduce Distributional Opponent-aided Multi-agent Actor-Critic (DOMAC), the first speculative opponent modelling algorithm that relies solely on local information (i.e., the controlled agent's observations, actions, and rewards). Specifically, the actor maintains a speculated belief about the opponents using the tailored speculative opponent models that predict the opponents' actions using only local information. Moreover, DOMAC features distributional critic models that estimate the return distribution of the actor's policy, yielding a more fine-grained assessment of the actor's quality. This thus more effectively guides the training of the speculative opponent models that the actor depends upon. Furthermore, we formally derive a policy gradient theorem with the proposed opponent models. Extensive experiments under eight different challenging multi-agent benchmark tasks within the MPE, Pommerman and StarCraft Multiagent Challenge (SMAC) demonstrate that our DOMAC successfully models opponents' behaviours and delivers superior performance against state-of-the-art methods with a faster convergence speed.
comment: 13 pages, 27 figures
♻ ☆ Beyond Quantities: Machine Learning-based Characterization of Inequality in Infrastructure Quality Provision in Cities
The objective of this study is to characterize inequality in infrastructure quality across urban areas. While a growing of body of literature has recognized the importance of characterizing infrastructure inequality in cities and provided quantified metrics to inform urban development plans, the majority of the existing approaches focus primarily on measuring the quantity of infrastructure, assuming that more infrastructure is better. Also, the existing research focuses primarily on index-based approaches in which the status of infrastructure provision in urban areas is determined based on assumed subjective weights. The focus on infrastructure quantity and use of indices obtained from subjective weights has hindered the ability to properly examine infrastructure inequality as it pertains to urban inequality and environmental justice considerations. Recognizing this gap, we propose a machine learning-based approach in which infrastructure features that shape environmental hazard exposure are identified and we use the weights obtained by the model to calculate an infrastructure quality provision for spatial areas of cities and accordingly, quantify the extent of inequality in infrastructure quality. The implementation of the model in five metropolitan areas in the U.S. demonstrates the capability of the proposed approach in characterizing inequality in infrastructure quality and capturing city-specific differences in the weights of infrastructure features. The results also show that areas in which low-income populations reside have lower infrastructure quality provision, suggesting the lower infrastructure quality provision as a determinant of urban disparities. Accordingly, the proposed approach can be effectively used to inform integrated urban design strategies to promote infrastructure equity and environmental justice based on data-driven and machine intelligence-based insights.
♻ ☆ Symmetry Breaking and Equivariant Neural Networks
Using symmetry as an inductive bias in deep learning has been proven to be a principled approach for sample-efficient model design. However, the relationship between symmetry and the imperative for equivariance in neural networks is not always obvious. Here, we analyze a key limitation that arises in equivariant functions: their incapacity to break symmetry at the level of individual data samples. In response, we introduce a novel notion of 'relaxed equivariance' that circumvents this limitation. We further demonstrate how to incorporate this relaxation into equivariant multilayer perceptrons (E-MLPs), offering an alternative to the noise-injection method. The relevance of symmetry breaking is then discussed in various application domains: physics, graph representation learning, combinatorial optimization and equivariant decoding.
comment: 14 pages, 2 figures, Symmetry and Geometry in Neural Representations
♻ ☆ Physics-Enhanced Multi-fidelity Learning for Optical Surface Imprint
Human fingerprints serve as one unique and powerful characteristic for each person, from which policemen can recognize the identity. Similar to humans, many natural bodies and intrinsic mechanical qualities can also be uniquely identified from surface characteristics. To measure the elasto-plastic properties of one material, one formally sharp indenter is pushed into the measured body under constant force and retracted, leaving a unique residual imprint of the minute size from several micrometers to nanometers. However, one great challenge is how to map the optical image of this residual imprint into the real wanted mechanical properties, \ie, the tensile force curve. In this paper, we propose a novel method to use multi-fidelity neural networks (MFNN) to solve this inverse problem. We first build up the NN model via pure simulation data, and then bridge the sim-to-real gap via transfer learning. Considering the difficulty of collecting real experimental data, we use NN to dig out the unknown physics and also implant the known physics into the transfer learning framework, thus highly improving the model stability and decreasing the data requirement. The final constructed model only needs three-shot calibration of real materials. We tested the final model across 20 real materials and achieved satisfying accuracy. This work serves as one great example of applying machine learning into scientific research, especially under the constraints of data limitation and fidelity variance.
comment: 15 pages, 11 figure
♻ ☆ Stronger Graph Transformer with Regularized Attention Scores
Graph Neural Networks are notorious for its memory consumption. A recent Transformer-based GNN called Graph Transformer is shown to obtain superior performances when long range dependencies exist. However, combining graph data and Transformer architecture led to a combinationally worse memory issue. We propose a novel version of "edge regularization technique" that alleviates the need for Positional Encoding and ultimately alleviate GT's out of memory issue. We observe that it is not clear whether having an edge regularization on top of positional encoding is helpful. However, it seems evident that applying our edge regularization technique indeed stably improves GT's performance compared to GT without Positional Encoding.
♻ ☆ Enhancing Worker Recruitment in Collaborative Mobile Crowdsourcing: A Graph Neural Network Trust Evaluation Approach
Collaborative Mobile Crowdsourcing (CMCS) allows platforms to recruit worker teams to collaboratively execute complex sensing tasks. The efficiency of such collaborations could be influenced by trust relationships among workers. To obtain the asymmetric trust values among all workers in the social network, the Trust Reinforcement Evaluation Framework (TREF) based on Graph Convolutional Neural Networks (GCNs) is proposed in this paper. The task completion effect is comprehensively calculated by considering the workers' ability benefits, distance benefits, and trust benefits in this paper. The worker recruitment problem is modeled as an Undirected Complete Recruitment Graph (UCRG), for which a specific Tabu Search Recruitment (TSR) algorithm solution is proposed. An optimal execution team is recruited for each task by the TSR algorithm, and the collaboration team for the task is obtained under the constraint of privacy loss. To enhance the efficiency of the recruitment algorithm on a large scale and scope, the Mini-Batch K-Means clustering algorithm and edge computing technology are introduced, enabling distributed worker recruitment. Lastly, extensive experiments conducted on five real datasets validate that the recruitment algorithm proposed in this paper outperforms other baselines. Additionally, TREF proposed herein surpasses the performance of state-of-the-art trust evaluation methods in the literature.
comment: The article has been accepted by IEEE TMC, and its DOI is 10.1109/TMC.2024.3373469
♻ ☆ UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification
Vision-Language Pre-training (VLP) that utilizes the multi-modal information to promote the training efficiency and effectiveness, has achieved great success in vision recognition of natural domains and shown promise in medical imaging diagnosis for the Chest X-Rays (CXRs). However, current works mainly pay attention to the exploration on single dataset of CXRs, which locks the potential of this powerful paradigm on larger hybrid of multi-source CXRs datasets. We identify that although blending samples from the diverse sources offers the advantages to improve the model generalization, it is still challenging to maintain the consistent superiority for the task of each source due to the existing heterogeneity among sources. To handle this dilemma, we design a Conquer-and-Divide pre-training framework, termed as UniChest, aiming to make full use of the collaboration benefit of multiple sources of CXRs while reducing the negative influence of the source heterogeneity. Specially, the ``Conquer" stage in UniChest encourages the model to sufficiently capture multi-source common patterns, and the ``Divide" stage helps squeeze personalized patterns into different small experts (query networks). We conduct thorough experiments on many benchmarks, e.g., ChestX-ray14, CheXpert, Vindr-CXR, Shenzhen, Open-I and SIIM-ACR Pneumothorax, verifying the effectiveness of UniChest over a range of baselines, and release our codes and pre-training models at https://github.com/Elfenreigen/UniChest.
comment: Accepted at IEEE Transactions on Medical Imaging
♻ ☆ Few-shot Adaption to Distribution Shifts By Mixing Source and Target Embeddings
Pretrained machine learning models need to be adapted to distribution shifts when deployed in new target environments. When obtaining labeled data from the target distribution is expensive, few-shot adaptation with only a few examples from the target distribution becomes essential. In this work, we propose MixPro, a lightweight and highly data-efficient approach for few-shot adaptation. MixPro first generates a relatively large dataset by mixing (linearly combining) pre-trained embeddings of large source data with those of the few target examples. This process preserves important features of both source and target distributions, while mitigating the specific noise in the small target data. Then, it trains a linear classifier on the mixed embeddings to effectively adapts the model to the target distribution without overfitting the small target data. Theoretically, we demonstrate the advantages of MixPro over previous methods. Our experiments, conducted across various model architectures on 8 datasets featuring different types of distribution shifts, reveal that MixPro can outperform baselines by up to 7\%, with only 2-4 target examples.
♻ ☆ Scalable Optimal Transport Methods in Machine Learning: A Contemporary Survey
Optimal Transport (OT) is a mathematical framework that first emerged in the eighteenth century and has led to a plethora of methods for answering many theoretical and applied questions. The last decade has been a witness to the remarkable contributions of this classical optimization problem to machine learning. This paper is about where and how optimal transport is used in machine learning with a focus on the question of scalable optimal transport. We provide a comprehensive survey of optimal transport while ensuring an accessible presentation as permitted by the nature of the topic and the context. First, we explain the optimal transport background and introduce different flavors (i.e., mathematical formulations), properties, and notable applications. We then address the fundamental question of how to scale optimal transport to cope with the current demands of big and high dimensional data. We conduct a systematic analysis of the methods used in the literature for scaling OT and present the findings in a unified taxonomy. We conclude with presenting some open challenges and discussing potential future research directions. A live repository of related OT research papers is maintained in https://github.com/abdelwahed/OT_for_big_data.git
comment: Accepted @ TPAMI 24
♻ ☆ tLaSDI: Thermodynamics-informed latent space dynamics identification
We propose a latent space dynamics identification method, namely tLaSDI, that embeds the first and second principles of thermodynamics. The latent variables are learned through an autoencoder as a nonlinear dimension reduction model. The latent dynamics are constructed by a neural network-based model that precisely preserves certain structures for the thermodynamic laws through the GENERIC formalism. An abstract error estimate is established, which provides a new loss formulation involving the Jacobian computation of autoencoder. The autoencoder and the latent dynamics are simultaneously trained to minimize the new loss. Computational examples demonstrate the effectiveness of tLaSDI, which exhibits robust generalization ability, even in extrapolation. In addition, an intriguing correlation is empirically observed between a quantity from tLaSDI in the latent space and the behaviors of the full-state solution.
comment: 32 pages, 8 figures
♻ ☆ Local and Global Trend Bayesian Exponential Smoothing Models
This paper describes a family of seasonal and non-seasonal time series models that can be viewed as generalisations of additive and multiplicative exponential smoothing models, to model series that grow faster than linear but slower than exponential. Their development is motivated by fast-growing, volatile time series. In particular, our models have a global trend that can smoothly change from additive to multiplicative, and is combined with a linear local trend. Seasonality when used is multiplicative in our models, and the error is always additive but is heteroscedastic and can grow through a parameter sigma. We leverage state-of-the-art Bayesian fitting techniques to accurately fit these models that are more complex and flexible than standard exponential smoothing models. When applied to the M3 competition data set, our models outperform the best algorithms in the competition as well as other benchmarks, thus achieving to the best of our knowledge the best results of per-series univariate methods on this dataset in the literature. An open-source software package of our method is available.
♻ ☆ Multiscale Hodge Scattering Networks for Data Analysis
We propose new scattering networks for signals measured on simplicial complexes, which we call \emph{Multiscale Hodge Scattering Networks} (MHSNs). Our construction is based on multiscale basis dictionaries on simplicial complexes, i.e., the $\kappa$-GHWT and $\kappa$-HGLET, which we recently developed for simplices of dimension $\kappa \in \mathbb{N}$ in a given simplicial complex by generalizing the node-based Generalized Haar-Walsh Transform (GHWT) and Hierarchical Graph Laplacian Eigen Transform (HGLET). The $\kappa$-GHWT and the $\kappa$-HGLET both form redundant sets (i.e., dictionaries) of multiscale basis vectors and the corresponding expansion coefficients of a given signal. Our MHSNs use a layered structure analogous to a convolutional neural network (CNN) to cascade the moments of the modulus of the dictionary coefficients. The resulting features are invariant to reordering of the simplices (i.e., node permutation of the underlying graphs). Importantly, the use of multiscale basis dictionaries in our MHSNs admits a natural pooling operation that is akin to local pooling in CNNs, and which may be performed either locally or per-scale. These pooling operations are harder to define in both traditional scattering networks based on Morlet wavelets, and geometric scattering networks based on Diffusion Wavelets. As a result, we are able to extract a rich set of descriptive yet robust features that can be used along with very simple machine learning methods (i.e., logistic regression or support vector machines) to achieve high-accuracy classification systems with far fewer parameters to train than most modern graph neural networks. Finally, we demonstrate the usefulness of our MHSNs in three distinct types of problems: signal classification, domain (i.e., graph/simplex) classification, and molecular dynamics prediction.
comment: 20 Pages, Comments Welcome
Cryptography and Security
☆ A Transfer Attack to Image Watermarks
Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In particular, multiple studies claimed that image watermark is robust in such setting. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
☆ Blockchain-based Pseudonym Management for Vehicle Twin Migrations in Vehicular Edge Metaverse
Driven by the great advances in metaverse and edge computing technologies, vehicular edge metaverses are expected to disrupt the current paradigm of intelligent transportation systems. As highly computerized avatars of Vehicular Metaverse Users (VMUs), the Vehicle Twins (VTs) deployed in edge servers can provide valuable metaverse services to improve driving safety and on-board satisfaction for their VMUs throughout journeys. To maintain uninterrupted metaverse experiences, VTs must be migrated among edge servers following the movements of vehicles. This can raise concerns about privacy breaches during the dynamic communications among vehicular edge metaverses. To address these concerns and safeguard location privacy, pseudonyms as temporary identifiers can be leveraged by both VMUs and VTs to realize anonymous communications in the physical space and virtual spaces. However, existing pseudonym management methods fall short in meeting the extensive pseudonym demands in vehicular edge metaverses, thus dramatically diminishing the performance of privacy preservation. To this end, we present a cross-metaverse empowered dual pseudonym management framework. We utilize cross-chain technology to enhance management efficiency and data security for pseudonyms. Furthermore, we propose a metric to assess the privacy level and employ a Multi-Agent Deep Reinforcement Learning (MADRL) approach to obtain an optimal pseudonym generating strategy. Numerical results demonstrate that our proposed schemes are high-efficiency and cost-effective, showcasing their promising applications in vehicular edge metaverses.
comment: 14 pages, 9 figures
☆ From Hardware Fingerprint to Access Token: Enhancing the Authentication on IoT Devices
The proliferation of consumer IoT products in our daily lives has raised the need for secure device authentication and access control. Unfortunately, these resource-constrained devices typically use token-based authentication, which is vulnerable to token compromise attacks that allow attackers to impersonate the devices and perform malicious operations by stealing the access token. Using hardware fingerprints to secure their authentication is a promising way to mitigate these threats. However, once attackers have stolen some hardware fingerprints (e.g., via MitM attacks), they can bypass the hardware authentication by training a machine learning model to mimic fingerprints or reusing these fingerprints to craft forge requests. In this paper, we present MCU-Token, a secure hardware fingerprinting framework for MCU-based IoT devices even if the cryptographic mechanisms (e.g., private keys) are compromised. MCU-Token can be easily integrated with various IoT devices by simply adding a short hardware fingerprint-based token to the existing payload. To prevent the reuse of this token, we propose a message mapping approach that binds the token to a specific request via generating the hardware fingerprints based on the request payload. To defeat the machine learning attacks, we mix the valid fingerprints with poisoning data so that attackers cannot train a usable model with the leaked tokens. MCU-Token can defend against armored adversary who may replay, craft, and offload the requests via MitM or use both hardware (e.g., use identical devices) and software (e.g., machine learning attacks) strategies to mimic the fingerprints. The system evaluation shows that MCU-Token can achieve high accuracy (over 97%) with a low overhead across various IoT devices and application scenarios.
☆ Attacking with Something That Does Not Exist: Low-Rate Flood with 'Proof of Non-Existence' Can Exhaust DNS Resolver CPU
NSEC3 is a proof of non-existence in DNSSEC, which provides an authenticated assertion that a queried resource does not exist in the target domain. NSEC3 consists of alphabetically sorted hashed names before and after the queried hostname. To make dictionary attacks harder, the hash function can be applied in multiple iterations, which however also increases the load on the DNS resolver during the computation of the SHA-1 hashes in NSEC3 records. Concerns about the load created by the computation of NSEC3 records on the DNS resolvers were already considered in the NSEC3 specifications RFC5155 and RFC9276. In February 2024, the potential of NSEC3 to exhaust DNS resolvers' resources was assigned a CVE-2023-50868, confirming that extra iterations of NSEC3 created substantial load. However, there is no published evaluation of the attack and the impact of the attack on the resolvers was not clarified. In this work we perform the first evaluation of the NSEC3-encloser attack against DNS resolver implementations and find that the NSEC3-encloser attack can still create a 72x increase in CPU instruction count, despite the victim resolver following RFC5155 recommendations in limiting hash iteration counts. The impact of the attack varies across the different DNS resolvers, but we show that with a sufficient volume of DNS packets the attack can increase CPU load and cause packet loss. We find that at a rate of 150 malicious NSEC3 records per second, depending on the DNS implementation, the loss rate of benign DNS requests varies between 2.7% and 30%. We provide a detailed description and implementation the NSEC3-encloser attack along with evaluation against five popular DNS resolver implementations. We also develop the first analysis how each NSEC3 parameter impacts the load inflicted on the victim resolver during NSEC3-encloser attack.
comment: 13 pages, 7 figures
☆ Differentially Private Ad Conversion Measurement
In this work, we study ad conversion measurement, a central functionality in digital advertising, where an advertiser seeks to estimate advertiser website (or mobile app) conversions attributed to ad impressions that users have interacted with on various publisher websites (or mobile apps). Using differential privacy (DP), a notion that has gained in popularity due to its strong mathematical guarantees, we develop a formal framework for private ad conversion measurement. In particular, we define the notion of an operationally valid configuration of the attribution rule, DP adjacency relation, contribution bounding scope and enforcement point. We then provide, for the set of configurations that most commonly arises in practice, a complete characterization, which uncovers a delicate interplay between attribution and privacy.
comment: To appear in PoPETS 2024
☆ VPAS: Publicly Verifiable and Privacy-Preserving Aggregate Statistics on Distributed Datasets
Aggregate statistics play an important role in extracting meaningful insights from distributed data while preserving privacy. A growing number of application domains, such as healthcare, utilize these statistics in advancing research and improving patient care. In this work, we explore the challenge of input validation and public verifiability within privacy-preserving aggregation protocols. We address the scenario in which a party receives data from multiple sources and must verify the validity of the input and correctness of the computations over this data to third parties, such as auditors, while ensuring input data privacy. To achieve this, we propose the "VPAS" protocol, which satisfies these requirements. Our protocol utilizes homomorphic encryption for data privacy, and employs Zero-Knowledge Proofs (ZKP) and a blockchain system for input validation and public verifiability. We constructed VPAS by extending existing verifiable encryption schemes into secure protocols that enable N clients to encrypt, aggregate, and subsequently release the final result to a collector in a verifiable manner. We implemented and experimentally evaluated VPAS with regard to encryption costs, proof generation, and verification. The findings indicate that the overhead associated with verifiability in our protocol is 10x lower than that incurred by simply using conventional zkSNARKs. This enhanced efficiency makes it feasible to apply input validation with public verifiability across a wider range of applications or use cases that can tolerate moderate computational overhead associated with proof generation.
☆ Cryptic Bytes: WebAssembly Obfuscation for Evading Cryptojacking Detection
WebAssembly has gained significant traction as a high-performance, secure, and portable compilation target for the Web and beyond. However, its growing adoption has also introduced new security challenges. One such threat is cryptojacking, where websites mine cryptocurrencies on visitors' devices without their knowledge or consent, often through the use of WebAssembly. While detection methods have been proposed, research on circumventing them remains limited. In this paper, we present the most comprehensive evaluation of code obfuscation techniques for WebAssembly to date, assessing their effectiveness, detectability, and overhead across multiple abstraction levels. We obfuscate a diverse set of applications, including utilities, games, and crypto miners, using state-of-the-art obfuscation tools like Tigress and wasm-mutate, as well as our novel tool, emcc-obf. Our findings suggest that obfuscation can effectively produce dissimilar WebAssembly binaries, with Tigress proving most effective, followed by emcc-obf and wasm-mutate. The impact on the resulting native code is also significant, although the V8 engine's TurboFan optimizer can reduce native code size by 30\% on average. Notably, we find that obfuscation can successfully evade state-of-the-art cryptojacking detectors. Although obfuscation can introduce substantial performance overheads, we demonstrate how obfuscation can be used for evading detection with minimal overhead in real-world scenarios by strategically applying transformations. These insights are valuable for researchers, providing a foundation for developing more robust detection methods. Additionally, we make our dataset of over 20,000 obfuscated WebAssembly binaries and the emcc-obf tool publicly available to stimulate further research.
☆ ECHO: Efficient Off-Chain Payments and Cross-Chain Swaps for Cryptocurrencies
In this paper, we present ECHO, a TEE-based layer-2 solution that tackles two crucial challenges in the realm of cryptocurrencies: off-chain payments and cross-chain swaps. It offers three notable features: - Channel-free off-chain payments: it allows a payer to make direct payments to anyone without requiring any on-chain relationship or intermediary channels. - Real-time yet decentralized cross-chain swaps: it is the first known solution that enables real-time cross-chain swaps without relying on a central server. This novel feature is made possible through a ground-breaking fair exchange protocol. - TEE crash-tolerance: it offers two solutions to handle TEE crashes, one of which involves an innovative application of time-lock puzzles in this context. We evaluate ECHO on a network consists of 1000 nodes and the evaluation results show that ECHO can achieve 7000 TPS
☆ Evaluating the Influence of Multi-Factor Authentication and Recovery Settings on the Security and Accessibility of User Accounts SP 2024
Nowadays, most online services offer different authentication methods that users can set up for multi-factor authentication but also as a recovery method. This configuration must be done thoroughly to prevent an adversary's access while ensuring the legitimate user does not lose access to their account. This is particularly important for fundamental everyday services, where either failure would have severe consequences. Nevertheless, little research has been done on the authentication of actual users regarding security and the risk of being locked out of their accounts. To foster research in this direction, this paper presents a study on the account settings of Google and Apple users. Considering the multi-factor authentication configuration and recovery options, we analyzed the account security and lock-out risks. Our results provide insights into the usage of multi-factor authentication in practice, show significant security differences between Google and Apple accounts, and reveal that many users would miss access to their accounts when losing a single authentication device.
comment: 10 pages, published and presented at ICISSP 2024
☆ Real-time Threat Detection Strategies for Resource-constrained Devices
As more devices connect to the internet, it becomes crucial to address their limitations and basic security needs. While much research focuses on utilizing ML and DL to tackle security challenges, there is often a tendency to overlook the practicality and feasibility of implementing these methods in real-time settings. This oversight stems from the constrained processing power and memory of certain devices (IoT devices), as well as concerns about the generalizability of these approaches. Focusing on the detection of DNS-tunneling attacks in a router as a case study, we present an end-to-end process designed to effectively address these challenges. The process spans from developing a lightweight DNS-tunneling detection model to integrating it into a resource-constrained device for real-time detection. Through our experiments, we demonstrate that utilizing stateless features for training the ML model, along with features chosen to be independent of the network configuration, leads to highly accurate results. The deployment of this carefully crafted model, optimized for embedded devices across diverse environments, resulted in high DNS-tunneling attack detection with minimal latency. With this work, we aim to encourage solutions that strike a balance between theoretical advancements and the practical applicability of ML approaches in the ever-evolving landscape of device security.
☆ A Taxmans guide to taxation of crypto assets
The Financial system has witnessed rapid technological changes. The rise of Bitcoin and other crypto assets based on Distributed Ledger Technology mark a fundamental change in the way people transact and transmit value over a decentralized network, spread across geographies. This has created regulatory and tax policy blind spots, as governments and tax administrations take time to understand and provide policy responses to this innovative, revolutionary, and fast-paced technology. Due to the breakneck speed of innovation in blockchain technology and advent of Decentralized Finance, Decentralized Autonomous Organizations and the Metaverse, it is unlikely that the policy interventions and guidance by regulatory authorities or tax administrations would be ahead or in sync with the pace of innovation. This paper tries to explain the principles on which crypto assets function, their underlying technology and relates them to the tax issues and taxable events which arise within this ecosystem. It also provides instances of tax and regulatory policy responses already in effect in various jurisdictions, including the recent changes in reporting standards by the FATF and the OECD. This paper tries to explain the rationale behind existing laws and policies and the challenges in their implementation. It also attempts to present a ballpark estimate of tax potential of this asset class and suggests creation of global public digital infrastructure that can address issues related to pseudonymity and extra-territoriality. The paper analyses both direct and indirect taxation issues related to crypto assets and discusses more recent aspects like proof-of-stake and maximal extractable value in greater detail.
comment: 105 pages, 59 figures and 4 Tables
☆ The Power of Bamboo: On the Post-Compromise Security for Searchable Symmetric Encryption NDSS 2023
Dynamic searchable symmetric encryption (DSSE) enables users to delegate the keyword search over dynamically updated encrypted databases to an honest-but-curious server without losing keyword privacy. This paper studies a new and practical security risk to DSSE, namely, secret key compromise (e.g., a user's secret key is leaked or stolen), which threatens all the security guarantees offered by existing DSSE schemes. To address this open problem, we introduce the notion of searchable encryption with key-update (SEKU) that provides users with the option of non-interactive key updates. We further define the notion of post-compromise secure with respect to leakage functions to study whether DSSE schemes can still provide data security after the client's secret key is compromised. We demonstrate that post-compromise security is achievable with a proposed protocol called ``Bamboo". Interestingly, the leakage functions of Bamboo satisfy the requirements for both forward and backward security. We conduct a performance evaluation of Bamboo using a real-world dataset and compare its runtime efficiency with the existing forward-and-backward secure DSSE schemes. The result shows that Bamboo provides strong security with better or comparable performance.
comment: This is a full version paper that includes the security proof. The paper with the same name has been published by NDSS 2023
☆ DP-Dueling: Learning from Preference Feedback without Compromising User Privacy
We consider the well-studied dueling bandit problem, where a learner aims to identify near-optimal actions using pairwise comparisons, under the constraint of differential privacy. We consider a general class of utility-based preference matrices for large (potentially unbounded) decision spaces and give the first differentially private dueling bandit algorithm for active learning with user preferences. Our proposed algorithms are computationally efficient with near-optimal performance, both in terms of the private and non-private regret bound. More precisely, we show that when the decision space is of finite size $K$, our proposed algorithm yields order optimal $O\Big(\sum_{i = 2}^K\log\frac{KT}{\Delta_i} + \frac{K}{\epsilon}\Big)$ regret bound for pure $\epsilon$-DP, where $\Delta_i$ denotes the suboptimality gap of the $i$-th arm. We also present a matching lower bound analysis which proves the optimality of our algorithms. Finally, we extend our results to any general decision space in $d$-dimensions with potentially infinite arms and design an $\epsilon$-DP algorithm with regret $\tilde{O} \left( \frac{d^6}{\kappa \epsilon } + \frac{ d\sqrt{T }}{\kappa} \right)$, providing privacy for free when $T \gg d$.
☆ Tie-Breaking Rule Based on Partial Proof of Work in a Blockchain
Numerous methods have been proposed for suppressing intentional forks by attackers in blockchain systems. Among these, last-generated rules, which select the latest chain among chains in a tie, are effective methods that do not require significant changes to the blockchain protocol. However, existing methods either require a trusted third party or rely on timestamps that attackers can manipulate which makes applying a last-generated rule to existing systems such as Bitcoin challenging. To address these issues, we propose a last-generated rule that can be easily applied to existing proof of work blockchain systems. Our method uses partial proof of work, which does not function as a block, as a time standard with finer granularity. Only weak synchronization, which is already met by existing systems, is required for effective functioning. We evaluated the proposed method through a detailed analysis that is lacking in existing works. In networks that adopt our method, the proportion of the attacker hashrate necessary for selfish mining was approximately 0.31479 or higher, regardless of the block propagation capability of the attacker. Furthermore, we demonstrated through extended selfish mining that the impact of Match against pre-generated block, which is a concern in all last-generated rules, can be mitigated with appropriate parameter settings.
☆ Clean-image Backdoor Attacks
To gather a significant quantity of annotated training data for high-performance image classification models, numerous companies opt to enlist third-party providers to label their unlabeled data. This practice is widely regarded as secure, even in cases where some annotated errors occur, as the impact of these minor inaccuracies on the final performance of the models is negligible and existing backdoor attacks require attacker's ability to poison the training images. Nevertheless, in this paper, we propose clean-image backdoor attacks which uncover that backdoors can still be injected via a fraction of incorrect labels without modifying the training images. Specifically, in our attacks, the attacker first seeks a trigger feature to divide the training images into two parts: those with the feature and those without it. Subsequently, the attacker falsifies the labels of the former part to a backdoor class. The backdoor will be finally implanted into the target model after it is trained on the poisoned data. During the inference phase, the attacker can activate the backdoor in two ways: slightly modifying the input image to obtain the trigger feature, or taking an image that naturally has the trigger feature as input. We conduct extensive experiments to demonstrate the effectiveness and practicality of our attacks. According to the experimental results, we conclude that our attacks seriously jeopardize the fairness and robustness of image classification models, and it is necessary to be vigilant about the incorrect labels in outsourced labeling.
☆ FileDES: A Secure Scalable and Succinct Decentralized Encrypted Storage Network INFOCOM
Decentralized Storage Network (DSN) is an emerging technology that challenges traditional cloud-based storage systems by consolidating storage capacities from independent providers and coordinating to provide decentralized storage and retrieval services. However, current DSNs face several challenges associated with data privacy and efficiency of the proof systems. To address these issues, we propose FileDES (\uline{D}ecentralized \uline{E}ncrypted \uline{S}torage), which incorporates three essential elements: privacy preservation, scalable storage proof, and batch verification. FileDES provides encrypted data storage while maintaining data availability, with a scalable Proof of Encrypted Storage (PoES) algorithm that is resilient to Sybil and Generation attacks. Additionally, we introduce a rollup-based batch verification approach to simultaneously verify multiple files using publicly verifiable succinct proofs. We conducted a comparative evaluation on FileDES, Filecoin, Storj and Sia under various conditions, including a WAN composed of up to 120 geographically dispersed nodes. Our protocol outperforms the others in terms of proof generation/verification efficiency, storage costs, and scalability.
comment: 10 pages, 8 figures, 1 table. Accepted by 2024 IEEE INFOCOM
☆ Enabling Physical Localization of Uncooperative Cellular Devices
In cellular networks, it can become necessary for authorities to physically locate user devices for tracking criminals or illegal devices. While cellular operators can provide authorities with cell information the device is camping on, fine-grained localization is still required. Therefore, the authorized agents trace the device by monitoring its uplink signals. However, tracking the uplink signal source without its cooperation is challenging even for operators and authorities. Particularly, three challenges remain for fine-grained localization: i) localization works only if devices generate enough uplink traffic reliably over time, ii) the target device might generate its uplink traffic with significantly low power, and iii) cellular repeater may add too much noise to true uplink signals. While these challenges present practical hurdles for localization, they have been overlooked in prior works. In this work, we investigate the impact of these real-world challenges on cellular localization and propose an Uncooperative Multiangulation Attack (UMA) that addresses these challenges. UMA can 1) force a target device to transmit traffic continuously, 2) boost the target's signal strength to the maximum, and 3) uniquely distinguish traffic from the target and the repeaters. Notably, the UMA technique works without privilege on cellular operators or user devices, which makes it operate on any LTE network. Our evaluations show that UMA effectively resolves the challenges in real-world environments when devices are not cooperative for localization. Our approach exploits the current cellular design vulnerabilities, which we have responsibly disclosed to GSMA.
☆ Snail: Secure Single Iteration Localization
Localization is a computer vision task by which the position and orientation of a camera is determined from an image and environmental map. We propose a method for performing localization in a privacy preserving manner supporting two scenarios: first, when the image and map are held by a client who wishes to offload localization to untrusted third parties, and second, when the image and map are held separately by untrusting parties. Privacy preserving localization is necessary when the image and map are confidential, and offloading conserves on-device power and frees resources for other tasks. To accomplish this we integrate existing localization methods and secure multi-party computation (MPC), specifically garbled circuits, yielding proof-based security guarantees in contrast to existing obfuscation-based approaches which recent related work has shown vulnerable. We present two approaches to localization, a baseline data-oblivious adaptation of localization suitable for garbled circuits and our novel Single Iteration Localization. Our technique improves overall performance while maintaining confidentiality of the input image, map, and output pose at the expense of increased communication rounds but reduced computation and communication required per round. Single Iteration Localization is over two orders of magnitude faster than a straightforward application of garbled circuits to localization enabling real-world usage in the first robot to offload localization without revealing input images, environmental map, position, or orientation to offload servers.
☆ Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation
In this article, we address the problem of federated learning in the presence of stragglers. For this problem, a coded federated learning framework has been proposed, where the central server aggregates gradients received from the non-stragglers and gradient computed from a privacy-preservation global coded dataset to mitigate the negative impact of the stragglers. However, when aggregating these gradients, fixed weights are consistently applied across iterations, neglecting the generation process of the global coded dataset and the dynamic nature of the trained model over iterations. This oversight may result in diminished learning performance. To overcome this drawback, we propose a new method named adaptive coded federated learning (ACFL). In ACFL, before the training, each device uploads a coded local dataset with additive noise to the central server to generate a global coded dataset under privacy preservation requirements. During each iteration of the training, the central server aggregates the gradients received from the non-stragglers and the gradient computed from the global coded dataset, where an adaptive policy for varying the aggregation weights is designed. Under this policy, we optimize the performance in terms of privacy and learning, where the learning performance is analyzed through convergence analysis and the privacy performance is characterized via mutual information differential privacy. Finally, we perform simulations to demonstrate the superiority of ACFL compared with the non-adaptive methods.
☆ Differentially Private Next-Token Prediction of Large Language Models
Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model in such a way that guarantees Differential Privacy (DP). However, DP-SGD requires longer training times and larger memory requirements than SGD, while overestimating an adversary's capabilities in having white box access to the model. A more realistic scenario assumes only black-box access to a privacy-sensitive LLM. Motivated by these observations, we present Private Mixing of Ensemble Distributions (PMixED): a private prediction protocol that achieves practical next-token prediction by projecting each of the model's output distribution from an ensemble of fine-tuned LLMs onto a set around a public LLM's output distribution, then averaging the projected distributions and sampling from it. Our approach is more lightweight than DP-SGD in that it is model agnostic, instead providing differential privacy at prediction rather than during training. Our results show that PMixED achieves a stronger privacy guarantee than sample-level privacy and outperforms DP-SGD for privacy $\epsilon = 8$ on large-scale datasets.
☆ Assessing Web Fingerprinting Risk WWW
Modern Web APIs allow developers to provide extensively customized experiences for website visitors, but the richness of the device information they provide also make them vulnerable to being abused to construct browser fingerprints, device-specific identifiers that enable covert tracking of users even when cookies are disabled. Previous research has established entropy, a measure of information, as the key metric for quantifying fingerprinting risk. However, earlier studies had two major limitations. First, their entropy estimates were based on either a single website or a very small sample of devices. Second, they did not adequately consider correlations among different Web APIs, potentially grossly overestimating their fingerprinting risk. We provide the first study of browser fingerprinting which addresses the limitations of prior work. Our study is based on actual visited pages and Web APIs reported by tens of millions of real Chrome browsers in-the-wild. We accounted for the dependencies and correlations among Web APIs, which is crucial for obtaining more realistic entropy estimates. We also developed a novel experimental design that accurately and efficiently estimates entropy while never observing too much information from any single user. Our results provide an understanding of the distribution of entropy for different website categories, confirm the utility of entropy as a fingerprinting proxy, and offer a method for evaluating browser enhancements which are intended to mitigate fingerprinting.
comment: A version of this report to appear in the proceedings of The Web Conference (WWW) 2024. This version contains additional material in the appendix
☆ Just another copy and paste? Comparing the security vulnerabilities of ChatGPT generated code and StackOverflow answers SP
Sonatype's 2023 report found that 97% of developers and security leads integrate generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), into their development process. Concerns about the security implications of this trend have been raised. Developers are now weighing the benefits and risks of LLMs against other relied-upon information sources, such as StackOverflow (SO), requiring empirical data to inform their choice. In this work, our goal is to raise software developers awareness of the security implications when selecting code snippets by empirically comparing the vulnerabilities of ChatGPT and StackOverflow. To achieve this, we used an existing Java dataset from SO with security-related questions and answers. Then, we asked ChatGPT the same SO questions, gathering the generated code for comparison. After curating the dataset, we analyzed the number and types of Common Weakness Enumeration (CWE) vulnerabilities of 108 snippets from each platform using CodeQL. ChatGPT-generated code contained 248 vulnerabilities compared to the 302 vulnerabilities found in SO snippets, producing 20% fewer vulnerabilities with a statistically significant difference. Additionally, ChatGPT generated 19 types of CWE, fewer than the 22 found in SO. Our findings suggest developers are under-educated on insecure code propagation from both platforms, as we found 274 unique vulnerabilities and 25 types of CWE. Any code copied and pasted, created by AI or humans, cannot be trusted blindly, requiring good software engineering practices to reduce risk. Future work can help minimize insecure code propagation from any platform.
comment: 8 pages, 2 figures, accepted at Deep Learning Security and Privacy Workshop (DLSP) part of IEEE Symposium on Security and Privacy Workshops (SPW) for 2024
☆ Medical Image Data Provenance for Medical Cyber-Physical System
Continuous advancements in medical technology have led to the creation of affordable mobile imaging devices suitable for telemedicine and remote monitoring. However, the rapid examination of large populations poses challenges, including the risk of fraudulent practices by healthcare professionals and social workers exchanging unverified images via mobile applications. To mitigate these risks, this study proposes using watermarking techniques to embed a device fingerprint (DFP) into captured images, ensuring data provenance. The DFP, representing the unique attributes of the capturing device and raw image, is embedded into raw images before storage, thus enabling verification of image authenticity and source. Moreover, a robust remote validation method is introduced to authenticate images, enhancing the integrity of medical image data in interconnected healthcare systems. Through a case study on mobile fundus imaging, the effectiveness of the proposed framework is evaluated in terms of computational efficiency, image quality, security, and trustworthiness. This approach is suitable for a range of applications, including telemedicine, the Internet of Medical Things (IoMT), eHealth, and Medical Cyber-Physical Systems (MCPS) applications, providing a reliable means to maintain data provenance in diagnostic settings utilizing medical images or videos.
☆ Privacy-Preserving End-to-End Spoken Language Understanding IJCAI
Spoken language understanding (SLU), one of the key enabling technologies for human-computer interaction in IoT devices, provides an easy-to-use user interface. Human speech can contain a lot of user-sensitive information, such as gender, identity, and sensitive content. New types of security and privacy breaches have thus emerged. Users do not want to expose their personal sensitive information to malicious attacks by untrusted third parties. Thus, the SLU system needs to ensure that a potential malicious attacker cannot deduce the sensitive attributes of the users, while it should avoid greatly compromising the SLU accuracy. To address the above challenge, this paper proposes a novel SLU multi-task privacy-preserving model to prevent both the speech recognition (ASR) and identity recognition (IR) attacks. The model uses the hidden layer separation technique so that SLU information is distributed only in a specific portion of the hidden layer, and the other two types of information are removed to obtain a privacy-secure hidden layer. In order to achieve good balance between efficiency and privacy, we introduce a new mechanism of model pre-training, namely joint adversarial training, to further enhance the user privacy. Experiments over two SLU datasets show that the proposed method can reduce the accuracy of both the ASR and IR attacks close to that of a random guess, while leaving the SLU performance largely unaffected.
comment: Accepted by IJCAI
☆ Twin Auto-Encoder Model for Learning Separable Representation in Cyberattack Detection
Representation Learning (RL) plays a pivotal role in the success of many problems including cyberattack detection. Most of the RL methods for cyberattack detection are based on the latent vector of Auto-Encoder (AE) models. An AE transforms raw data into a new latent representation that better exposes the underlying characteristics of the input data. Thus, it is very useful for identifying cyberattacks. However, due to the heterogeneity and sophistication of cyberattacks, the representation of AEs is often entangled/mixed resulting in the difficulty for downstream attack detection models. To tackle this problem, we propose a novel mod called Twin Auto-Encoder (TAE). TAE deterministically transforms the latent representation into a more distinguishable representation namely the \textit{separable representation} and the reconstructsuct the separable representation at the output. The output of TAE called the \textit{reconstruction representation} is input to downstream models to detect cyberattacks. We extensively evaluate the effectiveness of TAE using a wide range of bench-marking datasets. Experiment results show the superior accuracy of TAE over state-of-the-art RL models and well-known machine learning algorithms. Moreover, TAE also outperforms state-of-the-art models on some sophisticated and challenging attacks. We then investigate various characteristics of TAE to further demonstrate its superiority.
☆ Multiple-Input Auto-Encoder Guided Feature Selection for IoT Intrusion Detection Systems
While intrusion detection systems (IDSs) benefit from the diversity and generalization of IoT data features, the data diversity (e.g., the heterogeneity and high dimensions of data) also makes it difficult to train effective machine learning models in IoT IDSs. This also leads to potentially redundant/noisy features that may decrease the accuracy of the detection engine in IDSs. This paper first introduces a novel neural network architecture called Multiple-Input Auto-Encoder (MIAE). MIAE consists of multiple sub-encoders that can process inputs from different sources with different characteristics. The MIAE model is trained in an unsupervised learning mode to transform the heterogeneous inputs into lower-dimensional representation, which helps classifiers distinguish between normal behaviour and different types of attacks. To distil and retain more relevant features but remove less important/redundant ones during the training process, we further design and embed a feature selection layer right after the representation layer of MIAE resulting in a new model called MIAEFS. This layer learns the importance of features in the representation vector, facilitating the selection of informative features from the representation vector. The results on three IDS datasets, i.e., NSLKDD, UNSW-NB15, and IDS2017, show the superior performance of MIAE and MIAEFS compared to other methods, e.g., conventional classifiers, dimensionality reduction models, unsupervised representation learning methods with different input dimensions, and unsupervised feature selection models. Moreover, MIAE and MIAEFS combined with the Random Forest (RF) classifier achieve accuracy of 96.5% in detecting sophisticated attacks, e.g., Slowloris. The average running time for detecting an attack sample using RF with the representation of MIAE and MIAEFS is approximate 1.7E-6 seconds, whilst the model size is lower than 1 MB.
♻ ☆ LogPrécis: Unleashing Language Models for Automated Malicious Log Analysis
The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPr\'ecis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPr\'ecis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPr\'ecis, released as open source, paves the way for better and more responsive defense against cyberattacks.
comment: 18 pages, Computer&Security (https://www.sciencedirect.com/science/article/pii/S0167404824001068), code available at https://github.com/SmartData-Polito/logprecis, models available at https://huggingface.co/SmartDataPolito
♻ ☆ Inferentialist Resource Semantics
In systems modelling, a system typically comprises located resources relative to which processes execute. One important use of logic in informatics is in modelling such systems for the purpose of reasoning (perhaps automated) about their behaviour and properties. To this end, one requires an interpretation of logical formulae in terms of the resources and states of the system; such an interpretation is called a resource semantics of the logic. This paper shows how inferentialism -- the view that meaning is given in terms of inferential behaviour -- enables a versatile and expressive framework for resource semantics. Specifically, how inferentialism seamlessly incorporates the assertion-based approach of the logic of Bunched Implications, foundational in program verification (e.g., as the basis of Separation Logic), and the renowned number-of-uses reading of Linear Logic. This integration enables reasoning about shared and separated resources in intuitive and familiar ways, as well as about the composition and interfacing of system components.
♻ ☆ Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images ICLR 2024
Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
comment: Accepted by ICLR 2024
♻ ☆ SoK: Analysis techniques for WebAssembly
WebAssembly is a low-level bytecode language that allows high-level languages like C, C++, and Rust to be executed in the browser at near-native performance. In recent years, WebAssembly has gained widespread adoption is now natively supported by all modern browsers. However, vulnerabilities in memory-unsafe languages, like C and C++, can translate into vulnerabilities in WebAssembly binaries. Unfortunately, most WebAssembly binaries are compiled from such memory-unsafe languages, and these vulnerabilities have been shown to be practical in real-world scenarios. WebAssembly smart contracts have also been found to be vulnerable, causing significant financial loss. Additionally, WebAssembly has been used for malicious purposes like cryptojacking. To address these issues, several analysis techniques for WebAssembly binaries have been proposed. In this paper, we conduct a comprehensive literature review of these techniques and categorize them based on their analysis strategy and objectives. Furthermore, we compare and evaluate the techniques using quantitative data, highlighting their strengths and weaknesses. In addition, one of the main contributions of this paper is the identification of future research directions based on the thorough literature review conducted.
♻ ☆ Similarity-based Label Inference Attack against Training and Inference of Split Learning
Split learning is a promising paradigm for privacy-preserving distributed learning. The learning model can be cut into multiple portions to be collaboratively trained at the participants by exchanging only the intermediate results at the cut layer. Understanding the security performance of split learning is critical for many privacy-sensitive applications. This paper shows that the exchanged intermediate results, including the smashed data (i.e., extracted features from the raw data) and gradients during training and inference of split learning, can already reveal the private labels. We mathematically analyze the potential label leakages and propose the cosine and Euclidean similarity measurements for gradients and smashed data, respectively. Then, the two similarity measurements are shown to be unified in Euclidean space. Based on the similarity metric, we design three label inference attacks to efficiently recover the private labels during both the training and inference phases. Experimental results validate that the proposed approaches can achieve close to 100% accuracy of label attacks. The proposed attack can still achieve accurate predictions against various state-of-the-art defense mechanisms, including DP-SGD, label differential privacy, gradient compression, and Marvell.
♻ ☆ Private, Anonymous, Collateralizable Commitments vs. MEV
In this work, we introduce the private, anonymous, collateralizable commitments (PACCs) framework. PACCs allow any smart contract wallet holder to collateralize a claim, request, or commitment in general, in a private and anonymous manner. PACCs can prove arbitrarily much or little about the wallet generating the commitment, and/or the transaction which is being committed. We demonstrate that PACCs can be applied to effectively eliminate maximal-extractable value (MEV) in DeFi where it currently occurs, shifting MEV instead to censorship. After describing our protocol with detail, we provide an implementation using the Ethereum blockchain, and whose benchmarks prove how PACCs are completely feasible.
♻ ☆ Bridging BRC-20 to Ethereum
In this paper, we design, implement, and (partially-) evaluate a lightweight bridge (as a type of middleware) to connect the Bitcoin and Ethereum networks that were heterogeneously uncontactable before. Inspired by the recently introduced Bitcoin Request Comment (BRC-20) standard, we leverage the flexibility of Bitcoin inscriptions by embedding editable operations within each satoshi and mapping them to programmable Ethereum smart contracts. A user can initialize his/her requests from the Bitcoin network, subsequently triggering corresponding actions on the Ethereum network. We validate the lightweight nature of our solution and its ability to facilitate secure and seamless interactions between two heterogeneous ecosystems.
comment: Accepted by IEEE ICBC 2024
♻ ☆ Differentially Private Communication of Measurement Anomalies in the Smart Grid
In this paper, we present a framework based on differential privacy (DP) for querying electric power measurements to detect system anomalies or bad data. Our DP approach conceals consumption and system matrix data, while simultaneously enabling an untrusted third party to test hypotheses of anomalies, such as the presence of bad data, by releasing a randomized sufficient statistic for hypothesis-testing. We consider a measurement model corrupted by Gaussian noise and a sparse noise vector representing the attack, and we observe that the optimal test statistic is a chi-square random variable. To detect possible attacks, we propose a novel DP chi-square noise mechanism that ensures the test does not reveal private information about power injections or the system matrix. The proposed framework provides a robust solution for detecting bad data while preserving the privacy of sensitive power system data.
comment: 13 pages, 5 figures
♻ ☆ Defending Against Unforeseen Failure Modes with Latent Adversarial Training
AI systems sometimes exhibit harmful unintended behaviors post-deployment. This is often despite extensive diagnostics and debugging by developers. Minimizing risks from models is challenging because the attack surface is so large. It is not tractable to exhaustively search for inputs that may cause a model to fail. Red-teaming and adversarial training (AT) are commonly used to make AI systems more robust. However, they have not been sufficient to avoid many real-world failure modes that differ from the ones adversarially trained on. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
♻ ☆ Differentially Private Conditional Independence Testing
Conditional independence (CI) tests are widely used in statistical data analysis, e.g., they are the building block of many algorithms for causal graph discovery. The goal of a CI test is to accept or reject the null hypothesis that $X \perp \!\!\! \perp Y \mid Z$, where $X \in \mathbb{R}, Y \in \mathbb{R}, Z \in \mathbb{R}^d$. In this work, we investigate conditional independence testing under the constraint of differential privacy. We design two private CI testing procedures: one based on the generalized covariance measure of Shah and Peters (2020) and another based on the conditional randomization test of Cand\`es et al. (2016) (under the model-X assumption). We provide theoretical guarantees on the performance of our tests and validate them empirically. These are the first private CI tests with rigorous theoretical guarantees that work for the general case when $Z$ is continuous.
♻ ☆ The Decisive Power of Indecision: Low-Variance Risk-Limiting Audits and Election Contestation via Marginal Mark Recording
Risk-limiting audits (RLAs) are techniques for verifying the outcomes of large elections. While they provide rigorous guarantees of correctness, widespread adoption has been impeded by both efficiency concerns and the fact they offer statistical, rather than absolute, conclusions. We attend to both of these difficulties, defining new families of audits that improve efficiency and offer qualitative advances in statistical power. Our new audits are enabled by revisiting the standard notion of a cast-vote record so that it can declare multiple possible mark interpretations rather than a single decision; this can reflect the presence of ambiguous marks, which appear regularly on hand-marked ballots. We show that this simple expedient can offer significant efficiency improvements with only minor changes to existing auditing infrastructure. We establish that these Bayesian comparison audits are indeed risk-limiting in the formal sense of (Fuller, Harrison, and Russell, 2022). We then define a new type of post-election audit we call a contested audit. These permit each candidate to provide a cast-vote record table advancing their own claim to victory. We prove that these audits offer remarkable sample efficiency, yielding control of risk with a constant number of samples (that is independent of margin). This is a first for an audit with provable soundness. These results are formulated in a game-based security model that specify quantitative soundness and completeness guarantees. Finally, we observe that these audits provide a direct means to handle contestation of election results affirmed by conventional RLAs.
comment: 26 pages, no prior publication
Computation and Language
☆ LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Large Multimodal Models (LMMs) have shown significant reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically use a fixed amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which increase the number of visual tokens significantly. However, due to the design of the Transformer architecture, computational costs associated with these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism and find, similar to prior work, that many visual tokens are spatially redundant. Based on this, we propose PruMerge, a novel adaptive visual token reduction approach, which largely reduces the number of visual tokens while maintaining comparable model performance. We first select the unpruned visual tokens based on their similarity to class tokens and spatial tokens. We then cluster the pruned tokens based on key similarity and merge the clustered tokens with the unpruned tokens to supplement their information. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 14.4 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.
comment: Project page: https://llava-prumerge.github.io/
☆ Can large language models explore in-context?
We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.
☆ A Transfer Attack to Image Watermarks
Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In particular, multiple studies claimed that image watermark is robust in such setting. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
☆ Towards Knowledge-Grounded Natural Language Understanding and Generation
This thesis investigates how natural language understanding and generation with transformer models can benefit from grounding the models with knowledge representations and addresses the following key research questions: (i) Can knowledge of entities extend its benefits beyond entity-centric tasks, such as entity linking? (ii) How can we faithfully and effectively extract such structured knowledge from raw text, especially noisy web text? (iii) How do other types of knowledge, beyond structured knowledge, contribute to improving NLP tasks? Studies in this thesis find that incorporating relevant and up-to-date knowledge of entities benefits fake news detection, and entity-focused code-switching significantly enhances zero-shot cross-lingual transfer on entity-centric tasks. In terms of effective and faithful approaches to extracting structured knowledge, it is observed that integrating negative examples and training with entity planning significantly improves performance. Additionally, it is established that other general forms of knowledge, such as parametric and distilled knowledge, enhance multimodal and multilingual knowledge-intensive tasks. This research shows the tangible benefits of diverse knowledge integration and motivates further exploration in this direction.
comment: PhD Thesis
☆ CoLLEGe: Concept Embedding Generation for Large Language Models
Current language models are unable to quickly learn new concepts on the fly, often requiring a more involved finetuning process to learn robustly. Prompting in-context is not robust to context distractions, and often fails to confer much information about the new concepts. Classic methods for few-shot word learning in NLP, relying on global word vectors, are less applicable to large language models. In this paper, we introduce a novel approach named CoLLEGe (Concept Learning with Language Embedding Generation) to modernize few-shot concept learning. CoLLEGe is a meta-learning framework capable of generating flexible embeddings for new concepts using a small number of example sentences or definitions. Our primary meta-learning objective is simply to facilitate a language model to make next word predictions in forthcoming sentences, making it compatible with language model pretraining. We design a series of tasks to test new concept learning in challenging real-world scenarios, including new word acquisition, definition inference, and verbal reasoning, and demonstrate that our method succeeds in each setting without task-specific training.
☆ Multi-Review Fusion-in-Context NAACL 2024
Grounded text generation, encompassing tasks such as long-form question-answering and summarization, necessitates both content selection and content consolidation. Current end-to-end methods are difficult to control and interpret due to their opaqueness. Accordingly, recent works have proposed a modular approach, with separate components for each step. Specifically, we focus on the second subtask, of generating coherent text given pre-selected content in a multi-document setting. Concretely, we formalize \textit{Fusion-in-Context} (FiC) as a standalone task, whose input consists of source texts with highlighted spans of targeted content. A model then needs to generate a coherent passage that includes all and only the target information. Our work includes the development of a curated dataset of 1000 instances in the reviews domain, alongside a novel evaluation framework for assessing the faithfulness and coverage of highlights, which strongly correlate to human judgment. Several baseline models exhibit promising outcomes and provide insightful analyses. This study lays the groundwork for further exploration of modular text generation in the multi-document setting, offering potential improvements in the quality and reliability of generated content. \footnote{Our benchmark, FuseReviews, including the dataset, evaluation framework and designated leaderboard, can be found at \url{https://fusereviews.github.io/}.}
comment: NAACL 2024, findings
☆ CO-Fun: A German Dataset on Company Outsourcing in Fund Prospectuses for Named Entity Recognition and Relation Extraction
The process of cyber mapping gives insights in relationships among financial entities and service providers. Centered around the outsourcing practices of companies within fund prospectuses in Germany, we introduce a dataset specifically designed for named entity recognition and relation extraction tasks. The labeling process on 948 sentences was carried out by three experts which yields to 5,969 annotations for four entity types (Outsourcing, Company, Location and Software) and 4,102 relation annotations (Outsourcing-Company, Company-Location). State-of-the-art deep learning models were trained to recognize entities and extract relations showing first promising results. An anonymized version of the dataset, along with guidelines and the code used for model training, are publicly available at https://www.dfki.uni-kl.de/cybermapping/data/CO-Fun-1.0-anonymized.zip.
☆ Controlled Training Data Generation with Diffusion Models
In this work, we present a method to control a text-to-image generative model to produce training data specifically "useful" for supervised learning. Unlike previous works that employ an open-loop approach and pre-define prompts to generate new data using either a language model or human expertise, we develop an automated closed-loop system which involves two feedback mechanisms. The first mechanism uses feedback from a given supervised model and finds adversarial prompts that result in image generations that maximize the model loss. While these adversarial prompts result in diverse data informed by the model, they are not informed of the target distribution, which can be inefficient. Therefore, we introduce the second feedback mechanism that guides the generation process towards a certain target distribution. We call the method combining these two mechanisms Guided Adversarial Prompts. We perform our evaluations on different tasks, datasets and architectures, with different types of distribution shifts (spuriously correlated data, unseen domains) and demonstrate the efficiency of the proposed feedback mechanisms compared to open-loop approaches.
comment: Project page at https://adversarial-prompts.epfl.ch/
☆ Human behaviour through a LENS: How Linguistic content triggers Emotions and Norms and determines Strategy choices
Over the last two decades, a growing body of experimental research has provided evidence that linguistic frames influence human behaviour in economic games, beyond the economic consequences of the available actions. This article proposes a novel framework that transcends the traditional confines of outcome-based preference models. According to the LENS model, the Linguistic description of the decision problem triggers Emotional responses and suggests potential Norms of behaviour, which then interact to shape an individual's Strategic choice. The article reviews experimental evidence that supports each path of the LENS model. Furthermore, it identifies and discusses several critical research questions that arise from this model, pointing towards avenues for future inquiry.
☆ Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions ACL 2024
This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work. The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.
comment: 10 pages, 4 figures, submitted to ACL 2024, for a screencast see https://www.youtube.com/watch?v=9GJExMelhdI
☆ Specifying Genericity through Inclusiveness and Abstractness Continuous Scales
This paper introduces a novel annotation framework for the fine-grained modeling of Noun Phrases' (NPs) genericity in natural language. The framework is designed to be simple and intuitive, making it accessible to non-expert annotators and suitable for crowd-sourced tasks. Drawing from theoretical and cognitive literature on genericity, this framework is grounded in established linguistic theory. Through a pilot study, we created a small but crucial annotated dataset of 324 sentences, serving as a foundation for future research. To validate our approach, we conducted an evaluation comparing our continuous annotations with existing binary annotations on the same dataset, demonstrating the framework's effectiveness in capturing nuanced aspects of genericity. Our work offers a practical resource for linguists, providing a first annotated dataset and an annotation scheme designed to build real-language datasets that can be used in studies on the semantics of genericity, and NLP practitioners, contributing to the development of commonsense knowledge repositories valuable in enhancing various NLP applications.
☆ Event Temporal Relation Extraction based on Retrieval-Augmented on LLMs IJCNN2024
Event temporal relation (TempRel) is a primary subject of the event relation extraction task. However, the inherent ambiguity of TempRel increases the difficulty of the task. With the rise of prompt engineering, it is important to design effective prompt templates and verbalizers to extract relevant knowledge. The traditional manually designed templates struggle to extract precise temporal knowledge. This paper introduces a novel retrieval-augmented TempRel extraction approach, leveraging knowledge retrieved from large language models (LLMs) to enhance prompt templates and verbalizers. Our method capitalizes on the diverse capabilities of various LLMs to generate a wide array of ideas for template and verbalizer design. Our proposed method fully exploits the potential of LLMs for generation tasks and contributes more knowledge to our design. Empirical evaluations across three widely recognized datasets demonstrate the efficacy of our method in improving the performance of event temporal relation extraction tasks.
comment: 8 pages,6 figures.Accepted to the International Joint Conference on Neural Networks (IJCNN2024)
☆ Imagination Augmented Generation: Learning to Imagine Richer Context for Question Answering over Large Language Models
Retrieval-Augmented-Generation and Gener-ation-Augmented-Generation have been proposed to enhance the knowledge required for question answering over Large Language Models (LLMs). However, the former depends on external resources, and both require incorporating the explicit documents into the context, which results in longer contexts that lead to more resource consumption. Recent works indicate that LLMs have modeled rich knowledge, albeit not effectively triggered or activated. Inspired by this, we propose a novel knowledge-augmented framework, Imagination-Augmented-Generation (IAG), which simulates the human capacity to compensate for knowledge deficits while answering questions solely through imagination, without relying on external resources. Guided by IAG, we propose an imagine richer context method for question answering (IMcQA), which obtains richer context through the following two modules: explicit imagination by generating a short dummy document with long context compress and implicit imagination with HyperNetwork for generating adapter weights. Experimental results on three datasets demonstrate that IMcQA exhibits significant advantages in both open-domain and closed-book settings, as well as in both in-distribution performance and out-of-distribution generalizations. Our code will be available at https://github.com/Xnhyacinth/IAG.
☆ Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach
Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.
☆ FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
Modern Large Language Models (LLMs) are capable of following long and complex instructions that enable a diverse amount of user tasks. However, despite Information Retrieval (IR) models using LLMs as the backbone of their architectures, nearly all of them still only take queries as input, with no instructions. For the handful of recent models that do take instructions, it's unclear how they use them. We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR builds off the long history of the TREC conferences: as TREC provides human annotators with instructions (also known as narratives) to determine document relevance, so should IR models be able to understand and decide relevance based on these detailed instructions. Our evaluation benchmark starts with three deeply judged TREC collections and alters the annotator instructions, re-annotating relevant documents. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements (over 13%) after fine-tuning on our training set.
☆ Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping of EAS and keep parameter efficiency, which can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed. For instance, LaVIN-EAS can obtain 89.98\% accuracy on ScineceQA while speeding up inference by 2.2 times to LaVIN
☆ InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection AAAI
Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data that is useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.
comment: To appear at the 18th International AAAI Conference on Web and Social Media (ICWSM 2024) -- please cite accordingly
☆ Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study
Language model-based code completion models have quickly grown in use, helping thousands of developers write code in many different programming languages. However, research on code completion models typically focuses on imperative languages such as Python and JavaScript, which results in a lack of representation for functional programming languages. Consequently, these models often perform poorly on functional languages such as Haskell. To investigate whether this can be alleviated, we evaluate the performance of two language models for code, CodeGPT and UniXcoder, on the functional programming language Haskell. We fine-tune and evaluate the models on Haskell functions sourced from a publicly accessible Haskell dataset on HuggingFace. Additionally, we manually evaluate the models using our novel translated HumanEval dataset. Our automatic evaluation shows that knowledge of imperative programming languages in the pre-training of LLMs may not transfer well to functional languages, but that code completion on functional languages is feasible. Consequently, this shows the need for more high-quality Haskell datasets. A manual evaluation on HumanEval-Haskell indicates CodeGPT frequently generates empty predictions and extra comments, while UniXcoder more often produces incomplete or incorrect predictions. Finally, we release HumanEval-Haskell, along with the fine-tuned models and all code required to reproduce our experiments on GitHub (https://github.com/AISE-TUDelft/HaskellCCEval).
comment: To appear in the First Special Event on AI Foundation Models and Software Engineering (FORGE 2024)
☆ CACA Agent: Capability Collaboration based AI Agent
As AI Agents based on Large Language Models (LLMs) have shown potential in practical applications across various fields, how to quickly deploy an AI agent and how to conveniently expand the application scenario of AI agents has become a challenge. Previous studies mainly focused on implementing all the reasoning capabilities of AI agents within a single LLM, which often makes the model more complex and also reduces the extensibility of AI agent functionality. In this paper, we propose CACA Agent (Capability Collaboration based AI Agent), using an open architecture inspired by service computing. CACA Agent integrates a set of collaborative capabilities to implement AI Agents, not only reducing the dependence on a single LLM, but also enhancing the extensibility of both the planning abilities and the tools available to AI agents. Utilizing the proposed system, we present a demo to illustrate the operation and the application scenario extension of CACA Agent.
comment: 4 pages,5 figures
☆ Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Modern language models, while sophisticated, exhibit some inherent shortcomings, particularly in conversational settings. We claim that many of the observed shortcomings can be attributed to violation of one or more conversational principles. By drawing upon extensive research from both the social science and AI communities, we propose a set of maxims -- quantity, quality, relevance, manner, benevolence, and transparency -- for describing effective human-AI conversation. We first justify the applicability of the first four maxims (from Grice) in the context of human-AI interactions. We then argue that two new maxims, benevolence (concerning the generation of, and engagement with, harmful content) and transparency (concerning recognition of one's knowledge boundaries, operational constraints, and intents), are necessary for addressing behavior unique to modern human-AI interactions. The proposed maxims offer prescriptive guidance on how to assess conversational quality between humans and LLM-driven conversational agents, informing both their evaluation and improved design.
☆ Text clustering with LLM embeddings
Text clustering is an important approach for organising the growing amount of digital content, helping to structure and find hidden patterns in uncategorised data. In this research, we investigated how different textual embeddings - particularly those used in large language models (LLMs) - and clustering algorithms affect how text datasets are clustered. A series of experiments were conducted to assess how embeddings influence clustering results, the role played by dimensionality reduction through summarisation, and embedding size adjustment. Results reveal that LLM embeddings excel at capturing the nuances of structured language, while BERT leads the lightweight options in performance. In addition, we find that increasing embedding dimensionality and summarisation techniques do not uniformly improve clustering efficiency, suggesting that these strategies require careful analysis to use in real-life models. These results highlight a complex balance between the need for nuanced text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by incorporating embeddings from LLMs, thereby paving the way for improved methodologies and opening new avenues for future research in various types of textual analysis.
☆ Argument-Aware Approach To Event Linking
Event linking connects event mentions in text with relevant nodes in a knowledge base (KB). Prior research in event linking has mainly borrowed methods from entity linking, overlooking the distinct features of events. Compared to the extensively explored entity linking task, events have more complex structures and can be more effectively distinguished by examining their associated arguments. Moreover, the information-rich nature of events leads to the scarcity of event KBs. This emphasizes the need for event linking models to identify and classify event mentions not in the KB as ``out-of-KB,'' an area that has received limited attention. In this work, we tackle these challenges by introducing an argument-aware approach. First, we improve event linking models by augmenting input text with tagged event argument information, facilitating the recognition of key information about event mentions. Subsequently, to help the model handle ``out-of-KB'' scenarios, we synthesize out-of-KB training examples from in-KB instances through controlled manipulation of event arguments. Our experiment across two test datasets showed significant enhancements in both in-KB and out-of-KB scenarios, with a notable 22% improvement in out-of-KB evaluations.
comment: Work In Progress
☆ CHisIEC: An Information Extraction Corpus for Ancient Chinese History
Natural Language Processing (NLP) plays a pivotal role in the realm of Digital Humanities (DH) and serves as the cornerstone for advancing the structural analysis of historical and cultural heritage texts. This is particularly true for the domains of named entity recognition (NER) and relation extraction (RE). In our commitment to expediting ancient history and culture, we present the ``Chinese Historical Information Extraction Corpus''(CHisIEC). CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field. Spanning a remarkable historical timeline encompassing data from 13 dynasties spanning over 1830 years, CHisIEC epitomizes the extensive temporal range and text heterogeneity inherent in Chinese historical documents. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset comprising 14,194 entities and 8,609 relations. To establish the robustness and versatility of our dataset, we have undertaken comprehensive experimentation involving models of various sizes and paradigms. Additionally, we have evaluated the capabilities of Large Language Models (LLMs) in the context of tasks related to ancient Chinese history. The dataset and code are available at \url{https://github.com/tangxuemei1995/CHisIEC}.
comment: 11 pages, 6 tables, 3 figures
☆ Construction of a Japanese Financial Benchmark for Large Language Models LREC
With the recent development of large language models (LLMs), models that focus on certain domains and languages have been discussed for their necessity. There is also a growing need for benchmarks to evaluate the performance of current LLMs in each domain. Therefore, in this study, we constructed a benchmark comprising multiple tasks specific to the Japanese and financial domains and performed benchmark measurements on some models. Consequently, we confirmed that GPT-4 is currently outstanding, and that the constructed benchmarks function effectively. According to our analysis, our benchmark can differentiate benchmark scores among models in all performance ranges by combining tasks with different difficulties.
comment: 9 pages, Joint Workshop of the 7th Financial Technology and Natural Language Processing (FinNLP), the 5th Knowledge Discovery from Unstructured Data in Financial Services (KDF), and The 4th Workshop on Economics and Natural Language Processing (ECONLP) In conjunction with LREC-COLING-2024
☆ LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a LLaMA2-7B student model.
comment: Our code is available at https://github.com/SqueezeAILab/LLM2LLM
☆ ESG Classification by Implicit Rule Learning via GPT-4 LREC
Environmental, social, and governance (ESG) factors are widely adopted as higher investment return indicators. Accordingly, ongoing efforts are being made to automate ESG evaluation with language models to extract signals from massive web text easily. However, recent approaches suffer from a lack of training data, as rating agencies keep their evaluation metrics confidential. This paper investigates whether state-of-the-art language models like GPT-4 can be guided to align with unknown ESG evaluation criteria through strategies such as prompting, chain-of-thought reasoning, and dynamic in-context learning. We demonstrate the efficacy of these approaches by ranking 2nd in the Shared-Task ML-ESG-3 Impact Type track for Korean without updating the model on the provided training data. We also explore how adjusting prompts impacts the ability of language models to address financial tasks leveraging smaller models with openly available weights. We observe longer general pre-training to correlate with enhanced performance in financial downstream tasks. Our findings showcase the potential of language models to navigate complex, subjective evaluation guidelines despite lacking explicit training examples, revealing opportunities for training-free solutions for financial downstream tasks.
comment: Accepted as Shared Track Paper at 7th FinNLP Workshop @ LREC-COLING 2024
☆ MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness
This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.
☆ MasonTigers at SemEval-2024 Task 8: Performance Analysis of Transformer-based Models on Machine-Generated Text Detection
This paper presents the MasonTigers entry to the SemEval-2024 Task 8 - Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. The task encompasses Binary Human-Written vs. Machine-Generated Text Classification (Track A), Multi-Way Machine-Generated Text Classification (Track B), and Human-Machine Mixed Text Detection (Track C). Our best performing approaches utilize mainly the ensemble of discriminator transformer models along with sentence transformer and statistical machine learning approaches in specific cases. Moreover, zero-shot prompting and fine-tuning of FLAN-T5 are used for Track A and B.
☆ Risk and Response in Large Language Models: Evaluating Key Threat Categories
This paper explores the pressing issue of risk assessment in Large Language Models (LLMs) as they become increasingly prevalent in various applications. Focusing on how reward models, which are designed to fine-tune pretrained LLMs to align with human values, perceive and categorize different types of risks, we delve into the challenges posed by the subjective nature of preference-based training data. By utilizing the Anthropic Red-team dataset, we analyze major risk categories, including Information Hazards, Malicious Uses, and Discrimination/Hateful content. Our findings indicate that LLMs tend to consider Information Hazards less harmful, a finding confirmed by a specially developed regression model. Additionally, our analysis shows that LLMs respond less stringently to Information Hazards compared to other risks. The study further reveals a significant vulnerability of LLMs to jailbreaking attacks in Information Hazard scenarios, highlighting a critical security concern in LLM risk assessment and emphasizing the need for improved AI safety measures.
comment: 19 pages, 14 figures
☆ MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thoughts
Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques. Zero-shot and few-shot prompting generate reasonably good results when tested with proprietary LLMs, compared to the open-source models. We obtain further improved results with chain-of-thought prompting, an iterative prompting method that breaks down the reasoning process step-by-step. We obtain our best results by utilizing an ensemble of chain-of-thought prompts, placing 2nd in the word puzzle subtask and 13th in the sentence puzzle subtask. The strong performance of prompted LLMs demonstrates their capability for complex reasoning when provided with a decomposition of the thought process. Our work sheds light on how step-wise explanatory prompts can unlock more of the knowledge encoded in the parameters of large models.
☆ A Picture Is Worth a Graph: Blueprint Debate on Graph for Multimodal Reasoning
This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate BDoG, achieving state-of-the-art results in Science QA and MMBench with significant improvements over previous methods.
comment: Work in progress
☆ Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices
As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first moment enabled, and further increases these savings without the first moment. Besides, it enhances convergence speed and improves downstream task performance relative to its counterparts.
☆ Evidence-Driven Retrieval Augmented Response Generation for Online Misinformation NAACL 2024
The proliferation of online misinformation has posed significant threats to public interest. While numerous online users actively participate in the combat against misinformation, many of such responses can be characterized by the lack of politeness and supporting facts. As a solution, text generation approaches are proposed to automatically produce counter-misinformation responses. Nevertheless, existing methods are often trained end-to-end without leveraging external knowledge, resulting in subpar text quality and excessively repetitive responses. In this paper, we propose retrieval augmented response generation for online misinformation (RARG), which collects supporting evidence from scientific sources and generates counter-misinformation responses based on the evidences. In particular, our RARG consists of two stages: (1) evidence collection, where we design a retrieval pipeline to retrieve and rerank evidence documents using a database comprising over 1M academic articles; (2) response generation, in which we align large language models (LLMs) to generate evidence-based responses via reinforcement learning from human feedback (RLHF). We propose a reward function to maximize the utilization of the retrieved evidence while maintaining the quality of the generated text, which yields polite and factual responses that clearly refutes misinformation. To demonstrate the effectiveness of our method, we study the case of COVID-19 and perform extensive experiments with both in- and cross-domain datasets, where RARG consistently outperforms baselines by generating high-quality counter-misinformation responses.
comment: Accepted to NAACL 2024
☆ KnowLA: Enhancing Parameter-efficient Finetuning with Knowledgeable Adaptation NAACL 2024
Parameter-efficient finetuning (PEFT) is a key technique for adapting large language models (LLMs) to downstream tasks. In this paper, we study leveraging knowledge graph embeddings to improve the effectiveness of PEFT. We propose a knowledgeable adaptation method called KnowLA. It inserts an adaptation layer into an LLM to integrate the embeddings of entities appearing in the input text. The adaptation layer is trained in combination with LoRA on instruction data. Experiments on six benchmarks with two popular LLMs and three knowledge graphs demonstrate the effectiveness and robustness of KnowLA. We show that \modelname can help activate the relevant parameterized knowledge in an LLM to answer a question without changing its parameters or input prompts.
comment: Accepted in the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024)
☆ A Single Linear Layer Yields Task-Adapted Low-Rank Matrices LREC
Low-Rank Adaptation (LoRA) is a widely used Parameter-Efficient Fine-Tuning (PEFT) method that updates an initial weight matrix $W_0$ with a delta matrix $\Delta W$ consisted by two low-rank matrices $A$ and $B$. A previous study suggested that there is correlation between $W_0$ and $\Delta W$. In this study, we aim to delve deeper into relationships between $W_0$ and low-rank matrices $A$ and $B$ to further comprehend the behavior of LoRA. In particular, we analyze a conversion matrix that transform $W_0$ into low-rank matrices, which encapsulates information about the relationships. Our analysis reveals that the conversion matrices are similar across each layer. Inspired by these findings, we hypothesize that a single linear layer, which takes each layer's $W_0$ as input, can yield task-adapted low-rank matrices. To confirm this hypothesis, we devise a method named Conditionally Parameterized LoRA (CondLoRA) that updates initial weight matrices with low-rank matrices derived from a single linear layer. Our empirical results show that CondLoRA maintains a performance on par with LoRA, despite the fact that the trainable parameters of CondLoRA are fewer than those of LoRA. Therefore, we conclude that "a single linear layer yields task-adapted low-rank matrices."
comment: Accepted at LREC-COLING 2024
☆ On Zero-Shot Counterspeech Generation by LLMs LREC
With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large language models in zero-shot settings. In this work, we present a comprehensive analysis of the performances of four LLMs namely GPT-2, DialoGPT, ChatGPT and FlanT5 in zero-shot settings for counterspeech generation, which is the first of its kind. For GPT-2 and DialoGPT, we further investigate the deviation in performance with respect to the sizes (small, medium, large) of the models. On the other hand, we propose three different prompting strategies for generating different types of counterspeech and analyse the impact of such strategies on the performance of the models. Our analysis shows that there is an improvement in generation quality for two datasets (17%), however the toxicity increase (25%) with increase in model size. Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality but also have high toxicity as compared to DialoGPT. ChatGPT are much better at generating counter speech than other models across all metrics. In terms of prompting, we find that our proposed strategies help in improving counter speech generation across all the models.
comment: 12 pages, 7 tables, accepted at LREC-COLING 2024
☆ Attention-Driven Reasoning: Unlocking the Potential of Large Language Models
Large Language Models (LLMs) have shown remarkable capabilities, but their reasoning abilities and underlying mechanisms remain poorly understood. We present a novel approach to enhance LLMs' reasoning through attention mechanism optimization, without additional training data. We identify inefficiencies in the attention distribution caused by non-semantic tokens and propose an algorithm to re-balance the skewed distribution, enabling the model to abstract more nuanced knowledge. Our experiments demonstrate significantly improved reasoning capabilities, particularly for non-STEM questions. We provide insights into the role of attention patterns in LLMs' reasoning and propose a method to enhance these abilities, paving the way for more powerful and versatile language models.
☆ Hierarchical Skip Decoding for Efficient Autoregressive Text Generation
Autoregressive decoding strategy is a commonly used method for text generation tasks with pre-trained language models, while early-exiting is an effective approach to speedup the inference stage. In this work, we propose a novel decoding strategy named Hierarchical Skip Decoding (HSD) for efficient autoregressive text generation. Different from existing methods that require additional trainable components, HSD is a plug-and-play method applicable to autoregressive text generation models, it adaptively skips decoding layers in a hierarchical manner based on the current sequence length, thereby reducing computational workload and allocating computation resources. Comprehensive experiments on five text generation datasets with pre-trained language models demonstrate HSD's advantages in balancing efficiency and text quality. With almost half of the layers skipped, HSD can sustain 90% of the text quality compared to vanilla autoregressive decoding, outperforming the competitive approaches.
☆ Stance Reasoner: Zero-Shot Stance Detection on Social Media with Explicit Reasoning COLING 2024
Social media platforms are rich sources of opinionated content. Stance detection allows the automatic extraction of users' opinions on various topics from such content. We focus on zero-shot stance detection, where the model's success relies on (a) having knowledge about the target topic; and (b) learning general reasoning strategies that can be employed for new topics. We present Stance Reasoner, an approach to zero-shot stance detection on social media that leverages explicit reasoning over background knowledge to guide the model's inference about the document's stance on a target. Specifically, our method uses a pre-trained language model as a source of world knowledge, with the chain-of-thought in-context learning approach to generate intermediate reasoning steps. Stance Reasoner outperforms the current state-of-the-art models on 3 Twitter datasets, including fully supervised models. It can better generalize across targets, while at the same time providing explicit and interpretable explanations for its predictions.
comment: Accepted to COLING 2024
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by final loss and language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis
The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand the language preferences of speakers with different polarities. With these analyses, we also train baseline models to evaluate the dataset's quality.
comment: Lrec-Colin 2024
♻ ☆ LLMR: Real-time Prompting of Interactive Worlds using Large Language Models CHI 2024
We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.
comment: 46 pages, 18 figures; Matching version accepted at CHI 2024
♻ ☆ Building Efficient Universal Classifiers with Natural Language Inference
Generative Large Language Models (LLMs) have become the mainstream choice for fewshot and zeroshot learning thanks to the universality of text generation. Many users, however, do not need the broad capabilities of generative LLMs when they only want to automate a classification task. Smaller BERT-like models can also learn universal tasks, which allow them to do any text classification task without requiring fine-tuning (zeroshot classification) or to learn new tasks with only a few examples (fewshot), while being significantly more efficient than generative LLMs. This paper (1) explains how Natural Language Inference (NLI) can be used as a universal classification task that follows similar principles as instruction fine-tuning of generative LLMs, (2) provides a step-by-step guide with reusable Jupyter notebooks for building a universal classifier, and (3) shares the resulting universal classifier that is trained on 33 datasets with 389 diverse classes. Parts of the code we share has been used to train our older zeroshot classifiers that have been downloaded more than 55 million times via the Hugging Face Hub as of December 2023. Our new classifier improves zeroshot performance by 9.4%.
♻ ☆ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
♻ ☆ The optimal placement of the head in the noun phrase. The case of demonstrative, numeral, adjective and noun
The word order of a sentence is shaped by multiple principles. The principle of syntactic dependency distance minimization is in conflict with the principle of surprisal minimization (or predictability maximization) in single head syntactic dependency structures: while the former predicts that the head should be placed at the center of the linear arrangement, the latter predicts that the head should be placed at one of the ends (either first or last). A critical question is when surprisal minimization (or predictability maximization) should surpass syntactic dependency distance minimization. In the context of single head structures, it has been predicted that this is more likely to happen when two conditions are met, i.e. (a) fewer words are involved and (b) words are shorter. Here we test the prediction on the noun phrase when it is composed of a demonstrative, a numeral, an adjective and a noun. We find that, across preferred orders in languages, the noun tends to be placed at one of the ends, confirming the theoretical prediction. We also show evidence of anti locality effects: syntactic dependency distances in preferred orders are longer than expected by chance.
comment: Typos corrected
♻ ☆ Recurrent Drafter for Fast Speculative Decoding in Large Language Models
In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.
comment: 11 pages, 6 figures
♻ ☆ Large Language Model-informed ECG Dual Attention Network for Heart Failure Risk Prediction
Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. We introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual-attention ECG network designed to capture complex ECG features essential for early HF risk prediction, despite the notable imbalance between low and high-risk groups. This network incorporates a cross-lead attention module and twelve lead-specific temporal attention modules, focusing on cross-lead interactions and each lead's local dynamics. To further alleviate model overfitting, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the UK Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI).The results reveal that LLM-informed pre-training substantially enhances HF risk prediction in these cohorts. The dual-attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method's potential in advancing HF risk assessment with clinical complex ECG data.
comment: Under journal revision
♻ ☆ Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish LREC
The rapid spread of misinformation through social media platforms has raised concerns regarding its impact on public opinion. While misinformation is prevalent in other languages, the majority of research in this field has concentrated on the English language. Hence, there is a scarcity of datasets for other languages, including Turkish. To address this concern, we have introduced the FCTR dataset, consisting of 3238 real-world claims. This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations. Additionally, we aim to assess the effectiveness of cross-lingual transfer learning for low-resource languages, with a particular focus on Turkish. We demonstrate in-context learning (zero-shot and few-shot) performance of large language models in this context. The experimental results indicate that the dataset has the potential to advance research in the Turkish language.
comment: LREC-COLING 2024
♻ ☆ Robustness of the Random Language Model
The Random Language Model (De Giuli 2019) is an ensemble of stochastic context-free grammars, quantifying the syntax of human and computer languages. The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages. In its simplest formulation, it implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken. Here this picture is scrutinized by considering its robustness against extensions of the original model, and trajectories through parameter space different from those originally considered. It is shown here that (i) the scenario is robust to explicit symmetry breaking, an inevitable component of learning in the real world; and (ii) the transition to grammatical syntax can be encountered by fixing the deep (hidden) structure while varying the surface (observable) properties. It is also argued that the transition becomes a sharp thermodynamic transition in an idealized limit. Moreover, comparison with human data on the clustering coefficient of syntax networks suggests that the observed transition is equivalent to that normally experienced by children at age 24 months. The results are discussed in light of theory of first-language acquisition in linguistics, and recent successes in machine learning.
comment: 11 pages; v2: expanded discussion throughout
♻ ☆ VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art vision models, thereby facilitating the development of vision-oriented AI. VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts. Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agent
comment: 12 pages, 7 figures, pending conference
♻ ☆ Llama meets EU: Investigating the European Political Spectrum through the Lens of LLMs NAACL 2024
Instruction-finetuned Large Language Models inherit clear political leanings that have been shown to influence downstream task performance. We expand this line of research beyond the two-party system in the US and audit Llama Chat in the context of EU politics in various settings to analyze the model's political knowledge and its ability to reason in context. We adapt, i.e., further fine-tune, Llama Chat on speeches of individual euro-parties from debates in the European Parliament to reevaluate its political leaning based on the EUandI questionnaire. Llama Chat shows considerable knowledge of national parties' positions and is capable of reasoning in context. The adapted, party-specific, models are substantially re-aligned towards respective positions which we see as a starting point for using chat-based LLMs as data-driven conversational engines to assist research in political science.
comment: accepted to NAACL 2024 as a short paper
♻ ☆ FunQA: Towards Surprising Video Comprehension
Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question-answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Moreover, we propose FunMentor, an agent designed for Vision-Language Models (VLMs) that uses multi-turn dialogues to enhance models' understanding of counter-intuitiveness. Extensive experiments with existing VLMs demonstrate the effectiveness of FunMentor and reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.
comment: Project Page: https://funqa-benchmark.github.io/ Codebase: https://github.com/Jingkang50/FunQA
♻ ☆ Improving the Robustness of Large Language Models via Consistency Alignment LREC
Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.
comment: Accepted by LREC-COLING 2024
♻ ☆ PhoGPT: Generative Pre-training for Vietnamese
We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. In addition, we also demonstrate its superior performance compared to previous open-source models. Our PhoGPT models are available at: https://github.com/VinAIResearch/PhoGPT
comment: PhoGPT-4B Technical Report - 5 pages
♻ ☆ Self-Guard: Empower the LLM to Safeguard Itself
The jailbreak attack can bypass the safety measures of a Large Language Model (LLM), generating harmful content. This misuse of LLM has led to negative societal consequences. Currently, there are two main approaches to address jailbreak attacks: safety training and safeguards. Safety training focuses on further training LLM to enhance its safety. On the other hand, safeguards involve implementing external models or filters to prevent harmful outputs. However, safety training has constraints in its ability to adapt to new attack types and often leads to a drop in model performance. Safeguards have proven to be of limited help. To tackle these issues, we propose a novel approach called Self-Guard, which combines the strengths of both safety methods. Self-Guard includes two stages. In the first stage, we enhance the model's ability to assess harmful content, and in the second stage, we instruct the model to consistently perform harmful content detection on its own responses. The experiment has demonstrated that Self-Guard is robust against jailbreak attacks. In the bad case analysis, we find that LLM occasionally provides harmless responses to harmful queries. Additionally, we evaluated the general capabilities of the LLM before and after safety training, providing evidence that Self-Guard does not result in the LLM's performance degradation. In sensitivity tests, Self-Guard not only avoids inducing over-sensitivity in LLM but also can even mitigate this issue.
♻ ☆ E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the information entropy of hidden state features into a pruning metric design, namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse employs the information richness to leverage the channel importance, and further incorporates several novel techniques to put it into effect: (1) it introduces information entropy to enhance the significance of parameter weights and input feature norms as a novel pruning metric, and performs N:M sparsity without modifying the remaining weights. (2) it designs global naive shuffle and local block shuffle to quickly optimize the information distribution and adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere GPUs. Extensive experiments on the LLaMA family and OPT models show that E-Sparse can significantly speed up the model inference over the dense model (up to 1.53X) and obtain significant memory saving (up to 43.52%), with acceptable accuracy loss.
♻ ☆ Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation LREC
The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De->Dsb and WMT-2014 En->De, respectively, compared to Transformer baselines.
comment: Accepted to LREC-COLING 2024
♻ ☆ ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Self-attention is an essential component of large language models(LLMs) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$\times$ compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.
♻ ☆ Zero-Shot Cross-Lingual Document-Level Event Causality Identification with Heterogeneous Graph Contrastive Transfer Learning LREC
Event Causality Identification (ECI) refers to the detection of causal relations between events in texts. However, most existing studies focus on sentence-level ECI with high-resource languages, leaving more challenging document-level ECI (DECI) with low-resource languages under-explored. In this paper, we propose a Heterogeneous Graph Interaction Model with Multi-granularity Contrastive Transfer Learning (GIMC) for zero-shot cross-lingual document-level ECI. Specifically, we introduce a heterogeneous graph interaction network to model the long-distance dependencies between events that are scattered over a document. Then, to improve cross-lingual transferability of causal knowledge learned from the source language, we propose a multi-granularity contrastive transfer learning module to align the causal representations across languages. Extensive experiments show our framework outperforms the previous state-of-the-art model by 9.4% and 8.2% of average F1 score on monolingual and multilingual scenarios respectively. Notably, in the multilingual scenario, our zero-shot framework even exceeds GPT-3.5 with few-shot learning by 24.3% in overall performance.
comment: Accepted at LREC-COLING 2024
♻ ☆ HealMe: Harnessing Cognitive Reframing in Large Language Models for Psychotherapy
Large Language Models (LLMs) can play a vital role in psychotherapy by adeptly handling the crucial task of cognitive reframing and overcoming challenges such as shame, distrust, therapist skill variability, and resource scarcity. Previous LLMs in cognitive reframing mainly converted negative emotions to positive ones, but these approaches have limited efficacy, often not promoting clients' self-discovery of alternative perspectives. In this paper, we unveil the Helping and Empowering through Adaptive Language in Mental Enhancement (HealMe) model. This novel cognitive reframing therapy method effectively addresses deep-rooted negative thoughts and fosters rational, balanced perspectives. Diverging from traditional LLM methods, HealMe employs empathetic dialogue based on psychotherapeutic frameworks. It systematically guides clients through distinguishing circumstances from feelings, brainstorming alternative viewpoints, and developing empathetic, actionable suggestions. Moreover, we adopt the first comprehensive and expertly crafted psychological evaluation metrics, specifically designed to rigorously assess the performance of cognitive reframing, in both AI-simulated dialogues and real-world therapeutic conversations. Experimental results show that our model outperforms others in terms of empathy, guidance, and logical coherence, demonstrating its effectiveness and potential positive impact on psychotherapy.
comment: 17 pages, 4 figures
♻ ☆ mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
comment: Working in Process
♻ ☆ An LLM-Enhanced Adversarial Editing System for Lexical Simplification COLING 2024
Lexical Simplification (LS) aims to simplify text at the lexical level. Existing methods rely heavily on annotated data, making it challenging to apply in low-resource scenarios. In this paper, we propose a novel LS method without parallel corpora. This method employs an Adversarial Editing System with guidance from a confusion loss and an invariance loss to predict lexical edits in the original sentences. Meanwhile, we introduce an innovative LLM-enhanced loss to enable the distillation of knowledge from Large Language Models (LLMs) into a small-size LS system. From that, complex words within sentences are masked and a Difficulty-aware Filling module is crafted to replace masked positions with simpler words. At last, extensive experimental results and analyses on three benchmark LS datasets demonstrate the effectiveness of our proposed method.
comment: Accepted by COLING 2024 main conference
♻ ☆ KoCoSa: Korean Context-aware Sarcasm Detection Dataset
Sarcasm is a way of verbal irony where someone says the opposite of what they mean, often to ridicule a person, situation, or idea. It is often difficult to detect sarcasm in the dialogue since detecting sarcasm should reflect the context (i.e., dialogue history). In this paper, we introduce a new dataset for the Korean dialogue sarcasm detection task, KoCoSa (Korean Context-aware Sarcasm Detection Dataset), which consists of 12.8K daily Korean dialogues and the labels for this task on the last response. To build the dataset, we propose an efficient sarcasm detection dataset generation pipeline: 1) generating new sarcastic dialogues from source dialogues with large language models, 2) automatic and manual filtering of abnormal and toxic dialogues, and 3) human annotation for the sarcasm detection task. We also provide a simple but effective baseline for the Korean sarcasm detection task trained on our dataset. Experimental results on the dataset show that our baseline system outperforms strong baselines like large language models, such as GPT-3.5, in the Korean sarcasm detection task. We show that the sarcasm detection task relies deeply on the existence of sufficient context. We will release the dataset at https://github.com/Yu-billie/KoCoSa_sarcasm_detection.
♻ ☆ RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners LREC
Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Moreover, RankPrompt excels in LLM-based automatic evaluations for open-ended tasks, aligning with human judgments 74% of the time in the AlpacaEval dataset. It also exhibits robustness to variations in response order and consistency. Collectively, our results validate RankPrompt as an effective method for eliciting high-quality feedback from language models.
comment: LREC-Coling 2024 Long Paper
♻ ☆ Language Modeling for Content-enriched Recommendation
Recommender systems are indispensable in the realm of online applications, and sequential recommendation has enjoyed considerable prevalence due to its capacity to encapsulate the dynamic shifts in user interests. However, previous sequential modeling methods still have limitations in capturing contextual information. The primary reason is the lack of understanding of domain-specific knowledge and item-related textual content by language models. Fortunately, the emergence of powerful language models has unlocked the potential to incorporate extensive world knowledge into recommendation algorithms, enabling them to go beyond simple item attributes and truly understand the world surrounding user preferences. To achieve this, we propose LANCER, which leverages the semantic understanding capabilities of pre-trained language models to generate personalized recommendations. Our approach bridges the gap between language models and recommender systems, resulting in more human-like recommendations. We demonstrate the effectiveness of our approach through a series of experiments conducted on multiple benchmark datasets, showing promising results and providing valuable insights into the influence of our model on sequential recommendation tasks. Furthermore, our experimental codes are publicly available.
♻ ☆ Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models ICLR 2024
By design, large language models (LLMs) are static general-purpose models, expensive to retrain or update frequently. As they are increasingly adopted for knowledge-intensive tasks, it becomes evident that these design choices lead to failures to generate factual, relevant, and up-to-date knowledge. To this end, we propose Knowledge Card, a modular framework to plug in new factual and relevant knowledge into general-purpose LLMs. We first introduce knowledge cards -- specialized language models trained on corpora from specific domains and sources. Knowledge cards serve as parametric repositories that are selected at inference time to generate background knowledge for the base LLM. We then propose three content selectors to dynamically select and retain information in documents generated by knowledge cards, specifically controlling for relevance, brevity, and factuality of outputs. Finally, we propose two complementary integration approaches to augment the base LLM with the (relevant, factual) knowledge curated from the specialized LMs. Through extensive experiments, we demonstrate that Knowledge Card achieves state-of-the-art performance on six benchmark datasets. Ultimately, Knowledge Card framework enables dynamic synthesis and updates of knowledge from diverse domains. Its modularity will ensure that relevant knowledge can be continuously updated through the collective efforts of the research community.
comment: ICLR 2024, oral
♻ ☆ Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in non-English languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a (quasi)-zero-shot manner, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.
comment: https://github.com/OpenBMB/VisCPM.git
♻ ☆ Few-shot Adaption to Distribution Shifts By Mixing Source and Target Embeddings
Pretrained machine learning models need to be adapted to distribution shifts when deployed in new target environments. When obtaining labeled data from the target distribution is expensive, few-shot adaptation with only a few examples from the target distribution becomes essential. In this work, we propose MixPro, a lightweight and highly data-efficient approach for few-shot adaptation. MixPro first generates a relatively large dataset by mixing (linearly combining) pre-trained embeddings of large source data with those of the few target examples. This process preserves important features of both source and target distributions, while mitigating the specific noise in the small target data. Then, it trains a linear classifier on the mixed embeddings to effectively adapts the model to the target distribution without overfitting the small target data. Theoretically, we demonstrate the advantages of MixPro over previous methods. Our experiments, conducted across various model architectures on 8 datasets featuring different types of distribution shifts, reveal that MixPro can outperform baselines by up to 7\%, with only 2-4 target examples.
♻ ☆ Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation NAACL 2024
Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.
comment: NAACL 2024
♻ ☆ RoleInteract: Evaluating the Social Interaction of Role-Playing Agents
Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.
♻ ☆ NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models
Temporal Logic (TL) can be used to rigorously specify complex high-level specification for systems in many engineering applications. The translation between natural language (NL) and TL has been under-explored due to the lack of dataset and generalizable model across different application domains. In this paper, we propose an accurate and generalizable transformation framework of English instructions from NL to TL, exploring the use of Large Language Models (LLMs) at multiple stages. Our contributions are twofold. First, we develop a framework to create a dataset of NL-TL pairs combining LLMs and human annotation. We publish a dataset with 28K NL-TL pairs. Then, we finetune T5 models on the lifted versions (i.e., the specific Atomic Propositions (AP) are hidden) of the NL and TL. The enhanced generalizability originates from two aspects: 1) Usage of lifted NL-TL characterizes common logical structures, without constraints of specific domains. 2) Application of LLMs in dataset creation largely enhances corpus richness. We test the generalization of trained models on five varied domains. To achieve full NL-TL transformation, we either combine the lifted model with AP recognition task or do the further finetuning on each specific domain. During the further finetuning, our model achieves higher accuracy (>95%) using only <10% training data, compared with the baseline sequence to sequence (Seq2Seq) model.
comment: 25 pages, 18 figures
♻ ☆ AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers
For effective human-robot interaction, robots need to understand, plan, and execute complex, long-horizon tasks described by natural language. Recent advances in large language models (LLMs) have shown promise for translating natural language into robot action sequences for complex tasks. However, existing approaches either translate the natural language directly into robot trajectories or factor the inference process by decomposing language into task sub-goals and relying on a motion planner to execute each sub-goal. When complex environmental and temporal constraints are involved, inference over planning tasks must be performed jointly with motion plans using traditional task-and-motion planning (TAMP) algorithms, making factorization into subgoals untenable. Rather than using LLMs to directly plan task sub-goals, we instead perform few-shot translation from natural language task descriptions to an intermediate task representation that can then be consumed by a TAMP algorithm to jointly solve the task and motion plan. To improve translation, we automatically detect and correct both syntactic and semantic errors via autoregressive re-prompting, resulting in significant improvements in task completion. We show that our approach outperforms several methods using LLMs as planners in complex task domains. See our project website https://yongchao98.github.io/MIT-REALM-AutoTAMP/ for prompts, videos, and code.
comment: 8 pages, 4 figures
♻ ☆ PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model NeurIPS 2023
Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during the steps of generation. This issue is often attributed to exposure bias - the difference between how a model is trained, and how it is used during inference. Denoising diffusion models provide an alternative approach in which a model can revisit and revise its output. However, they can be computationally expensive and prior efforts on text have led to models that produce less fluent output compared to autoregressive models, especially for longer text and paragraphs. In this paper, we propose PLANNER, a model that combines latent semantic diffusion with autoregressive generation, to generate fluent text while exercising global control over paragraphs. The model achieves this by combining an autoregressive "decoding" module with a "planning" module that uses latent diffusion to generate semantic paragraph embeddings in a coarse-to-fine manner. The proposed method is evaluated on various conditional generation tasks, and results on semantic generation, text completion and summarization show its effectiveness in generating high-quality long-form text in an efficient manner.
comment: Accepted by NeurIPS 2023, code at https://github.com/apple/ml-planner
♻ ☆ Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles NAACL 2024
Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, the summarization of diverse information dispersed across multiple articles about an event remains underexplored. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Next, to enable consistent automatic evaluation, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of summaries. Through correlation analyses, we outline the best practices for effectively using automatic LLM-based metrics on the DiverseSumm dataset. Finally, we study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover under 40% of the diverse information on average.
comment: NAACL 2024
Software Engineering
☆ Enhancing Testing at Meta with Rich-State Simulated Populations ICSE 2024
This paper reports the results of the deployment of Rich-State Simulated Populations at Meta for both automated and manual testing. We use simulated users (aka test users) to mimic user interactions and acquire state in much the same way that real user accounts acquire state. For automated testing, we present empirical results from deployment on the Facebook, Messenger, and Instagram apps for iOS and Android Platforms. These apps consist of tens of millions of lines of code, communicating with hundreds of millions of lines of backend code, and are used by over 2 billion people every day. Our results reveal that rich state increases average code coverage by 38\%, and endpoint coverage by 61\%. More importantly, it also yields an average increase of 115\% in the faults found by automated testing. The rich-state test user populations are also deployed in a (continually evolving) Test Universe; a web-enabled simulation platform for privacy-safe manual testing, which has been used by over 21,000 Meta engineers since its deployment in November 2022.
comment: ICSE 2024
☆ ACCESS: Assurance Case Centric Engineering of Safety-critical Systems
Assurance cases are used to communicate and assess confidence in critical system properties such as safety and security. Historically, assurance cases have been manually created documents, which are evaluated by system stakeholders through lengthy and complicated processes. In recent years, model-based system assurance approaches have gained popularity to improve the efficiency and quality of system assurance activities. This becomes increasingly important, as systems becomes more complex, it is a challenge to manage their development life-cycles, including coordination of development, verification and validation activities, and change impact analysis in inter-connected system assurance artifacts. Moreover, there is a need for assurance cases that support evolution during the operational life of the system, to enable continuous assurance in the face of an uncertain environment, as Robotics and Autonomous Systems (RAS) are adopted into society. In this paper, we contribute ACCESS - Assurance Case Centric Engineering of Safety-critical Systems, an engineering methodology, together with its tool support, for the development of safety critical systems around evolving model-based assurance cases. We show how model-based system assurance cases can trace to heterogeneous engineering artifacts (e.g. system architectural models, system safety analysis, system behaviour models, etc.), and how formal methods can be integrated during the development process. We demonstrate how assurance cases can be automatically evaluated both at development and runtime. We apply our approach to a case study based on an Autonomous Underwater Vehicle (AUV).
☆ An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets
Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 large language models trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code. Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.
comment: Accepted to FORGE 2024
☆ Towards Deep Learning Enabled Cybersecurity Risk Assessment for Microservice Architectures
The widespread adoption of microservice architectures has given rise to a new set of software security challenges. These challenges stem from the unique features inherent in microservices. It is important to systematically assess and address software security challenges such as software security risk assessment. However, existing approaches prove inefficient in accurately evaluating the security risks associated with microservice architectures. To address this issue, we propose CyberWise Predictor, a framework designed for predicting and assessing security risks associated with microservice architectures. Our framework employs deep learning-based natural language processing models to analyze vulnerability descriptions for predicting vulnerability metrics to assess security risks. Our experimental evaluation shows the effectiveness of CyberWise Predictor, achieving an average accuracy of 92% in automatically predicting vulnerability metrics for new vulnerabilities. Our framework and findings serve as a guide for software developers to identify and mitigate security risks in microservice architectures.
☆ AllHands: Ask Me Anything on Large-scale Verbatim Feedback via Large Language Models
Verbatim feedback constitutes a valuable repository of user experiences, opinions, and requirements essential for software development. Effectively and efficiently extracting valuable insights from such data poses a challenging task. This paper introduces Allhands , an innovative analytic framework designed for large-scale feedback analysis through a natural language interface, leveraging large language models (LLMs). Allhands adheres to a conventional feedback analytic workflow, initially conducting classification and topic modeling on the feedback to convert them into a structurally augmented format, incorporating LLMs to enhance accuracy, robustness, generalization, and user-friendliness. Subsequently, an LLM agent is employed to interpret users' diverse questions in natural language on feedback, translating them into Python code for execution, and delivering comprehensive multi-modal responses, including text, code, tables, and images. We evaluate Allhands across three diverse feedback datasets. The experiments demonstrate that Allhands achieves superior efficacy at all stages of analysis, including classification and topic modeling, eventually providing users with an ``ask me anything'' experience with comprehensive, correct and human-readable response. To the best of our knowledge, Allhands stands as the first comprehensive feedback analysis framework that supports diverse and customized requirements for insight extraction through a natural language interface.
☆ On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions
Code completion is a key feature of Integrated Development Environments (IDEs), aimed at predicting the next tokens a developer is likely to write, helping them write code faster and with less effort. Modern code completion approaches are often powered by deep learning (DL) models. However, the swift evolution of programming languages poses a critical challenge to the performance of DL-based code completion models: Can these models generalize across different language versions? This paper delves into such a question. In particular, we assess the capabilities of a state-of-the-art model, CodeT5, to generalize across nine different Java versions, ranging from Java 2 to Java 17, while being exclusively trained on Java 8 code. Our evaluation spans three completion scenarios, namely, predicting tokens, constructs (e.g., the condition of an if statement) and entire code blocks. The results of our study reveal a noticeable disparity among language versions, with the worst performance being obtained in Java 2 and 17 - the most far apart versions compared to Java 8. We investigate possible causes for the performance degradation and show that the adoption of a limited version-specific fine-tuning can partially alleviate the problem. Our work raises awareness on the importance of continuous model refinement, and it can inform the design of alternatives to make code completion models more robust to language evolution.
☆ Testing for Fault Diversity in Reinforcement Learning ICSE 2024
Reinforcement Learning is the premier technique to approach sequential decision problems, including complex tasks such as driving cars and landing spacecraft. Among the software validation and verification practices, testing for functional fault detection is a convenient way to build trustworthiness in the learned decision model. While recent works seek to maximise the number of detected faults, none consider fault characterisation during the search for more diversity. We argue that policy testing should not find as many failures as possible (e.g., inputs that trigger similar car crashes) but rather aim at revealing as informative and diverse faults as possible in the model. In this paper, we explore the use of quality diversity optimisation to solve the problem of fault diversity in policy testing. Quality diversity (QD) optimisation is a type of evolutionary algorithm to solve hard combinatorial optimisation problems where high-quality diverse solutions are sought. We define and address the underlying challenges of adapting QD optimisation to the test of action policies. Furthermore, we compare classical QD optimisers to state-of-the-art frameworks dedicated to policy testing, both in terms of search efficiency and fault diversity. We show that QD optimisation, while being conceptually simple and generally applicable, finds effectively more diverse faults in the decision model, and conclude that QD-based policy testing is a promising approach.
comment: 11 pages, 4 figures, 1 algorithm, AST @ ICSE 2024
☆ Programmers Prefer Individually Assigned Tasks vs. Shared Responsibility
In traditional management, tasks are typically assigned to individuals, with each worker taking full responsibility for the success or failure of a task. In contrast, modern Agile, Lean, and eXtreme Programming practices advocate for shared responsibility, where an entire group is accountable for the outcome of a project or task. Despite numerous studies in other domains, the preferences of programmers have not been thoroughly analyzed. To address this gap, we conducted a survey featuring seven situational questions and collected the opinions of 120 software development practitioners. Our findings reveal that programmers prefer tasks to be assigned to them on an individual basis and appreciate taking personal responsibility for failures, as well as receiving individual rewards for successes. Understanding these preferences is crucial for project managers aiming to optimize team dynamics and ensure the successful completion of software projects.
☆ Comprehensive Evaluation and Insights into the Use of Large Language Models in the Automation of Behavior-Driven Development Acceptance Test Formulation
Behavior-driven development (BDD) is an Agile testing methodology fostering collaboration among developers, QA analysts, and stakeholders. In this manuscript, we propose a novel approach to enhance BDD practices using large language models (LLMs) to automate acceptance test generation. Our study uses zero and few-shot prompts to evaluate LLMs such as GPT-3.5, GPT-4, Llama-2-13B, and PaLM-2. The paper presents a detailed methodology that includes the dataset, prompt techniques, LLMs, and the evaluation process. The results demonstrate that GPT-3.5 and GPT-4 generate error-free BDD acceptance tests with better performance. The few-shot prompt technique highlights its ability to provide higher accuracy by incorporating examples for in-context learning. Furthermore, the study examines syntax errors, validation accuracy, and comparative analysis of LLMs, revealing their effectiveness in enhancing BDD practices. However, our study acknowledges that there are limitations to the proposed approach. We emphasize that this approach can support collaborative BDD processes and create opportunities for future research into automated BDD acceptance test generation using LLMs.
☆ "The Law Doesn't Work Like a Computer": Exploring Software Licensing Issues Faced by Legal Practitioners
Most modern software products incorporate open source components, which requires compliance with each component's licenses. As noncompliance can lead to significant repercussions, organizations often seek advice from legal practitioners to maintain license compliance, address licensing issues, and manage the risks of noncompliance. While legal practitioners play a critical role in the process, little is known in the software engineering community about their experiences within the open source license compliance ecosystem. To fill this knowledge gap, a joint team of software engineering and legal researchers designed and conducted a survey with 30 legal practitioners and related occupations and then held 16 follow-up interviews. We identified different aspects of OSS license compliance from the perspective of legal practitioners, resulting in 14 key findings in three main areas of interest: the general ecosystem of compliance, the specific compliance practices of legal practitioners, and the challenges that legal practitioners face. We discuss the implications of our findings.
comment: 24 pages, 2 figures, FSE 2024
☆ Just another copy and paste? Comparing the security vulnerabilities of ChatGPT generated code and StackOverflow answers SP
Sonatype's 2023 report found that 97% of developers and security leads integrate generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), into their development process. Concerns about the security implications of this trend have been raised. Developers are now weighing the benefits and risks of LLMs against other relied-upon information sources, such as StackOverflow (SO), requiring empirical data to inform their choice. In this work, our goal is to raise software developers awareness of the security implications when selecting code snippets by empirically comparing the vulnerabilities of ChatGPT and StackOverflow. To achieve this, we used an existing Java dataset from SO with security-related questions and answers. Then, we asked ChatGPT the same SO questions, gathering the generated code for comparison. After curating the dataset, we analyzed the number and types of Common Weakness Enumeration (CWE) vulnerabilities of 108 snippets from each platform using CodeQL. ChatGPT-generated code contained 248 vulnerabilities compared to the 302 vulnerabilities found in SO snippets, producing 20% fewer vulnerabilities with a statistically significant difference. Additionally, ChatGPT generated 19 types of CWE, fewer than the 22 found in SO. Our findings suggest developers are under-educated on insecure code propagation from both platforms, as we found 274 unique vulnerabilities and 25 types of CWE. Any code copied and pasted, created by AI or humans, cannot be trusted blindly, requiring good software engineering practices to reduce risk. Future work can help minimize insecure code propagation from any platform.
comment: 8 pages, 2 figures, accepted at Deep Learning Security and Privacy Workshop (DLSP) part of IEEE Symposium on Security and Privacy Workshops (SPW) for 2024
♻ ☆ LILAC: Log Parsing using LLMs with Adaptive Parsing Cache
Log parsing transforms log messages into structured formats, serving as the prerequisite step for various log analysis tasks. Although a variety of log parsing approaches have been proposed, their performance on complicated log data remains compromised due to the use of human-crafted rules or learning-based models with limited training data. The recent emergence of powerful large language models (LLMs) demonstrates their vast pre-trained knowledge related to code and logging, making it promising to apply LLMs for log parsing. However, their lack of specialized log parsing capabilities currently hinders their accuracy in parsing. Moreover, the inherent inconsistent answers, as well as the substantial overhead, prevent the practical adoption of LLM-based log parsing. To address these challenges, we propose LILAC, the first practical log parsing framework using LLMs with adaptive parsing cache. To facilitate accurate and robust log parsing, LILAC leverages the in-context learning (ICL) capability of the LLM by performing a hierarchical candidate sampling algorithm and selecting high-quality demonstrations. Furthermore, LILAC incorporates a novel component, an adaptive parsing cache, to store and refine the templates generated by the LLM. It helps mitigate LLM's inefficiency issue by enabling rapid retrieval of previously processed log templates. In this process, LILAC adaptively updates the templates within the parsing cache to ensure the consistency of parsed results. The extensive evaluation on public large-scale datasets shows that LILAC outperforms state-of-the-art methods by 69.5% in terms of the average F1 score of template accuracy. In addition, LILAC reduces the query times to LLMs by several orders of magnitude, achieving a comparable efficiency to the fastest baseline.
comment: This paper was accepted by The ACM International Conference on the Foundations of Software Engineering (FSE 2024)
♻ ☆ Empirically Exploring How Novices Write Software Models in Alloy
Writing declarative models has numerous benefits, ranging from automated reasoning and correction of design-level properties before systems are built, to automated testing and debugging of their implementations after they are built. Alloy is a declarative modeling language that is well-suited for verifying system designs. A key strength of Alloy is its scenario-finding toolset, the Analyzer, which allows users to explore all valid scenarios that adhere to the model's constraints up to a user-provided scope. However, even with visualized scenarios, it is difficult to write correct Alloy models. To address this, a growing body of work explores different techniques for debugging Alloy models. In order to develop and evaluate these techniques in an effective manor, this paper presents an empirical study of over 97,000 models written by novice users trying to learn Alloy. We investigate how users write both correct and incorrect models in order to produce a comprehensive benchmark for future use as well as a series of observations to guide debugging and educational efforts for Alloy model development.
♻ ☆ PatchZero: Zero-Shot Automatic Patch Correctness Assessment
Automated Program Repair (APR) techniques have shown more and more promising results in fixing real-world bugs. Despite the effectiveness, APR techniques still face an overfitting problem: a generated patch can be incorrect although it passes all tests. It is time-consuming to manually evaluate the correctness of generated patches that can pass all tests. To address this problem, many approaches have been proposed to automatically assess the correctness of patches generated by APR techniques. These approaches are mainly evaluated within the cross-validation setting. However, for patches generated by a new or unseen APR tool, users are implicitly required to manually label a significant portion of these patches in the cross-validation setting before inferring the remaining patches. To mitigate the issue, in this study, we propose \toolname, the patch correctness assessment by adopting a large language model for code. Specifically, for patches generated by a new or unseen APR tool, \toolname does not need labeled patches of this new or unseen APR tool for training but directly queries the large language model for code to get predictions on the correctness labels without training. In this way, \toolname can reduce the manual labeling effort when building a model to automatically assess the correctness of generated patches of new APR tools. \toolname prioritizes labeled patches from existing APR tools that exhibit semantic similarity to those generated by new APR tools, enhancing the accuracy achieved by \toolname for patches from new APR tools. Our experimental results showed that \toolname can achieve an accuracy of 84.4% and an F1-score of 86.5% on average although no labeled patch of the new or unseen APR tool is available. In addition, our proposed technique outperformed the prior state-of-the-art by a large margin.
comment: 18 pages
Programming Languages
☆ A hybrid approach to semi-automated Rust verification
While recent years have been witness to a large body of work on efficient and automated verification of safe Rust code, enabled by the rich guarantees of the Rust type system, much less progress has been made on reasoning about unsafe code due to its unique complexities. We propose a hybrid approach to end-to-end Rust verification in which powerful automated verification of safe Rust is combined with targeted semi-automated verification of unsafe~Rust. To this end, we present Gillian-Rust, a proof-of-concept semi-automated verification tool that is able to reason about type safety and functional correctness of unsafe~code. Built on top of the Gillian parametric compositional verification platform, Gillian-Rust automates a rich separation logic for real-world Rust, embedding the lifetime logic of RustBelt and the parametric propheciees of RustHornBelt. Using the unique extensibility of Gillian, our novel encoding of these features is fine-tuned to maximise automation and exposes a user-friendly API, allowing for low-effort verification of unsafe code. We link Gillian-Rust with Creusot, a state-of-the-art verifier for safe Rust, by providing a systematic encoding of unsafe code specifications that Creusot may use but not verify, demonstrating the feasibility of our hybrid~approach.
comment: 22 pages, 8 figures, preprint
☆ Fat API bindings of C++ objects into scripting languages
A fat API exposes nearly all of a C++ object's public attributes and methods to a consuming environment, such as a scripting language, or web client. This can be contrasted with a conventional, or thin API, where the API is defined up front, and the C++ object provides the implementation, most of which is private to the C++ layer. Obviously, reflection is required to expose C++ objects to a consuming layer like this -- this paper explores using the Classdesc system to implement reflection of C++ objects into a JavaScript/TypeScript environment via a RESTservice, and also via a Node.js API module.
☆ FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions
Reliable numerical computations are central to scientific computing, but the floating-point arithmetic that enables large-scale models is error-prone. Numeric exceptions are a common occurrence and can propagate through code, leading to flawed results. This paper presents FlowFPX, a toolkit for systematically debugging floating-point exceptions by recording their flow, coalescing exception contexts, and fuzzing in select locations. These tools help scientists discover when exceptions happen and track down their origin, smoothing the way to a reliable codebase.
comment: Presented at JuliaCon 2023; to appear in JuliaCon proceedings
♻ ☆ Free Doubly-Infinitary Distributive Categories are Cartesian Closed
We delve into the concept of categories with products that distribute over coproducts, which we call doubly-infinitary distributive categories. We show various instances of doubly-infinitary distributive categories aiming for a comparative analysis with established notions such as extensivity, infinitary distributiveness, and cartesian closedness. Our exploration reveals that this condition represents a substantial extension beyond the classical understanding of infinitary distributive categories. Our main theorem establishes that free doubly-infinitary distributive categories are cartesian closed. We end the paper with remarks on non-canonical isomorphisms, open questions, and future work.
comment: 15 pages, minor
♻ ☆ LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept of this approach. While such a proof-of-concept has shown promise, it still has significant limitations. State-of-the-art polyhedral compilers that use a deep-learning cost model only support a small subset of affine transformations, limiting their ability to apply complex code transformations. They also only support simple programs that have a single loop nest and a rectangular iteration domain, limiting their applicability to many programs. These limitations significantly impact the generality of such compilers and autoschedulers and put into question the whole approach. In this paper, we introduce LOOPer, the first polyhedral autoscheduler that uses a deep-learning based cost model and covers a large set of affine transformations and programs. It supports the exploration of a large set of affine transformations, allowing the application of complex sequences of polyhedral transformations. It also supports the optimization of programs with multiple loop nests and with rectangular and non-rectangular iteration domains, allowing the optimization of an extensive set of programs. We implement and evaluate LOOPer and show that it achieves speedups over the state-of-the-art. On the Polybench benchmark, LOOPer achieves a geometric mean speedup of 1.59x over Tiramisu. LOOPer also achieves competitive speedups with a geometric mean speedup of 1.34x over Pluto, a state-of-the-art polyhedral compiler that does not use a machine-learning based cost model.
Computer Vision and Pattern Recognition
☆ DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data CVPR 2024
Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/.
comment: The paper is accepted by CVPR 2024
☆ LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Large Multimodal Models (LMMs) have shown significant reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically use a fixed amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which increase the number of visual tokens significantly. However, due to the design of the Transformer architecture, computational costs associated with these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism and find, similar to prior work, that many visual tokens are spatially redundant. Based on this, we propose PruMerge, a novel adaptive visual token reduction approach, which largely reduces the number of visual tokens while maintaining comparable model performance. We first select the unpruned visual tokens based on their similarity to class tokens and spatial tokens. We then cluster the pruned tokens based on key similarity and merge the clustered tokens with the unpruned tokens to supplement their information. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 14.4 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.
comment: Project page: https://llava-prumerge.github.io/
☆ LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis
Recent text-to-3D generation approaches produce impressive 3D results but require time-consuming optimization that can take up to an hour per prompt. Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis. However, they cannot capture high-frequency geometry and texture details and struggle to scale to large prompt sets, so they generalize poorly. We introduce LATTE3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set. Key to our method is 1) building a scalable architecture and 2) leveraging 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. LATTE3D amortizes both neural field and textured surface generation to produce highly detailed textured meshes in a single forward pass. LATTE3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.
comment: See the project website at https://research.nvidia.com/labs/toronto-ai/LATTE3D/
☆ ThemeStation: Generating Theme-Aware 3D Assets from Few Exemplars
Real-world applications often require a large gallery of 3D assets that share a consistent theme. While remarkable advances have been made in general 3D content creation from text or image, synthesizing customized 3D assets following the shared theme of input 3D exemplars remains an open and challenging problem. In this work, we present ThemeStation, a novel approach for theme-aware 3D-to-3D generation. ThemeStation synthesizes customized 3D assets based on given few exemplars with two goals: 1) unity for generating 3D assets that thematically align with the given exemplars and 2) diversity for generating 3D assets with a high degree of variations. To this end, we design a two-stage framework that draws a concept image first, followed by a reference-informed 3D modeling stage. We propose a novel dual score distillation (DSD) loss to jointly leverage priors from both the input exemplars and the synthesized concept image. Extensive experiments and user studies confirm that ThemeStation surpasses prior works in producing diverse theme-aware 3D models with impressive quality. ThemeStation also enables various applications such as controllable 3D-to-3D generation.
comment: Project page: https://3dthemestation.github.io/
☆ DragAPart: Learning a Part-Level Motion Prior for Articulated Objects
We introduce DragAPart, a method that, given an image and a set of drags as input, can generate a new image of the same object in a new state, compatible with the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. To this end, we start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the new model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.
comment: Project page: https://dragapart.github.io/
☆ Long-CLIP: Unlocking the Long-Text Capability of CLIP
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
comment: All codes and models are publicly available at https://github.com/beichenzbc/Long-CLIP
☆ InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token prediction. Different training stages would guide our model to capture different levels of structure and semantic information through different pretext tasks. At the data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2. Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo2/.
comment: a technical report about video understanding
☆ Augmented Reality based Simulated Data (ARSim) with multi-view consistency for AV perception networks
Detecting a diverse range of objects under various driving scenarios is essential for the effectiveness of autonomous driving systems. However, the real-world data collected often lacks the necessary diversity presenting a long-tail distribution. Although synthetic data has been utilized to overcome this issue by generating virtual scenes, it faces hurdles such as a significant domain gap and the substantial efforts required from 3D artists to create realistic environments. To overcome these challenges, we present ARSim, a fully automated, comprehensive, modular framework designed to enhance real multi-view image data with 3D synthetic objects of interest. The proposed method integrates domain adaptation and randomization strategies to address covariate shift between real and simulated data by inferring essential domain attributes from real data and employing simulation-based randomization for other attributes. We construct a simplified virtual scene using real data and strategically place 3D synthetic assets within it. Illumination is achieved by estimating light distribution from multiple images capturing the surroundings of the vehicle. Camera parameters from real data are employed to render synthetic assets in each frame. The resulting augmented multi-view consistent dataset is used to train a multi-camera perception network for autonomous vehicles. Experimental results on various AV perception tasks demonstrate the superior performance of networks trained on the augmented dataset.
comment: 17 pages, 15 figures, 7 tables
☆ Learning Topological Representations for Deep Image Understanding
In many scenarios, especially biomedical applications, the correct delineation of complex fine-scaled structures such as neurons, tissues, and vessels is critical for downstream analysis. Despite the strong predictive power of deep learning methods, they do not provide a satisfactory representation of these structures, thus creating significant barriers in scalable annotation and downstream analysis. In this dissertation, we tackle such challenges by proposing novel representations of these topological structures in a deep learning framework. We leverage the mathematical tools from topological data analysis, i.e., persistent homology and discrete Morse theory, to develop principled methods for better segmentation and uncertainty estimation, which will become powerful tools for scalable annotation.
comment: Ph.D. thesis from Stony Brook University. This thesis includes works arXiv:1906.05404, arXiv:2110.08335, arXiv:2112.07812, arXiv:2103.09992, arXiv:2206.01742
☆ SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~\url{https://github.com/badripatro/Simba}.
☆ Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities
The development of foundation models has revolutionized our ability to interpret the Earth's surface using satellite observational data. Traditional models have been siloed, tailored to specific sensors or data types like optical, radar, and hyperspectral, each with its own unique characteristics. This specialization hinders the potential for a holistic analysis that could benefit from the combined strengths of these diverse data sources. Our novel approach introduces the Dynamic One-For-All (DOFA) model, leveraging the concept of neural plasticity in brain science to integrate various data modalities into a single framework adaptively. This dynamic hypernetwork, adjusting to different wavelengths, enables a single versatile Transformer jointly trained on data from five sensors to excel across 12 distinct Earth observation tasks, including sensors never seen during pretraining. DOFA's innovative design offers a promising leap towards more accurate, efficient, and unified Earth observation analysis, showcasing remarkable adaptability and performance in harnessing the potential of multimodal Earth observation data.
comment: 33 pages, 10 figures
☆ Fully automated workflow for the design of patient-specific orthopaedic implants: application to total knee arthroplasty
Arthroplasty is commonly performed to treat joint osteoarthritis, reducing pain and improving mobility. While arthroplasty has known several technical improvements, a significant share of patients are still unsatisfied with their surgery. Personalised arthroplasty improves surgical outcomes however current solutions require delays, making it difficult to integrate in clinical routine. We propose a fully automated workflow to design patient-specific implants, presented for total knee arthroplasty, the most widely performed arthroplasty in the world nowadays. The proposed pipeline first uses artificial neural networks to segment the proximal and distal extremities of the femur and tibia. Then the full bones are reconstructed using augmented statistical shape models, combining shape and landmarks information. Finally, 77 morphological parameters are computed to design patient-specific implants. The developed workflow has been trained using 91 CT scans of lower limb and evaluated on 41 CT scans manually segmented, in terms of accuracy and execution time. The workflow accuracy was $0.4\pm0.2mm$ for the segmentation, $1.2\pm0.4mm$ for the full bones reconstruction, and $2.8\pm2.2mm$ for the anatomical landmarks determination. The custom implants fitted the patients' anatomy with $0.6\pm0.2mm$ accuracy. The whole process from segmentation to implants' design lasted about 5 minutes. The proposed workflow allows for a fast and reliable personalisation of knee implants, directly from the patient CT image without requiring any manual intervention. It establishes a patient-specific pre-operative planning for TKA in a very short time making it easily available for all patients. Combined with efficient implant manufacturing techniques, this solution could help answer the growing number of arthroplasties while reducing complications and improving the patients' satisfaction.
☆ Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization CVPR 2024
In text-to-image personalization, a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background, nearby-object, tied-object, substance (in style re-contextualization), and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge, we propose SID~(Selectively Informative Description), a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps, subject-alignment, non-subject-disentanglement, and text-alignment.
comment: Published at CVPR 2024
☆ Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection AAAI2024
Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization.In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget.Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI).Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models.Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart.
comment: Accepted by AAAI2024
☆ Ultrasound Imaging based on the Variance of a Diffusion Restoration Model
Despite today's prevalence of ultrasound imaging in medicine, ultrasound signal-to-noise ratio is still affected by several sources of noise and artefacts. Moreover, enhancing ultrasound image quality involves balancing concurrent factors like contrast, resolution, and speckle preservation. Recently, there has been progress in both model-based and learning-based approaches addressing the problem of ultrasound image reconstruction. Bringing the best from both worlds, we propose a hybrid reconstruction method combining an ultrasound linear direct model with a learning-based prior coming from a generative Denoising Diffusion model. More specifically, we rely on the unsupervised fine-tuning of a pre-trained Denoising Diffusion Restoration Model (DDRM). Given the nature of multiplicative noise inherent to ultrasound, this paper proposes an empirical model to characterize the stochasticity of diffusion reconstruction of ultrasound images, and shows the interest of its variance as an echogenicity map estimator. We conduct experiments on synthetic, in-vitro, and in-vivo data, demonstrating the efficacy of our variance imaging approach in achieving high-quality image reconstructions from single plane-wave acquisitions and in comparison to state-of-the-art methods.
comment: 5 pages; submitted to EUSIPCO 2024. arXiv admin note: text overlap with arXiv:2310.20618
☆ Global Control for Local SO(3)-Equivariant Scale-Invariant Vessel Segmentation
Personalized 3D vascular models can aid in a range of diagnostic, prognostic, and treatment-planning tasks relevant to cardiovascular disease management. Deep learning provides a means to automatically obtain such models. Ideally, a user should have control over the exact region of interest (ROI) to be included in a vascular model, and the model should be watertight and highly accurate. To this end, we propose a combination of a global controller leveraging voxel mask segmentations to provide boundary conditions for vessels of interest to a local, iterative vessel segmentation model. We introduce the preservation of scale- and rotational symmetries in the local segmentation model, leading to generalisation to vessels of unseen sizes and orientations. Combined with the global controller, this enables flexible 3D vascular model building, without additional retraining. We demonstrate the potential of our method on a dataset containing abdominal aortic aneurysms (AAAs). Our method performs on par with a state-of-the-art segmentation model in the segmentation of AAAs, iliac arteries and renal arteries, while providing a watertight, smooth surface segmentation. Moreover, we demonstrate that by adapting the global controller, we can easily extend vessel sections in the 3D model.
☆ CR3DT: Camera-RADAR Fusion for 3D Detection and Tracking
Accurate detection and tracking of surrounding objects is essential to enable self-driving vehicles. While Light Detection and Ranging (LiDAR) sensors have set the benchmark for high performance, the appeal of camera-only solutions lies in their cost-effectiveness. Notably, despite the prevalent use of Radio Detection and Ranging (RADAR) sensors in automotive systems, their potential in 3D detection and tracking has been largely disregarded due to data sparsity and measurement noise. As a recent development, the combination of RADARs and cameras is emerging as a promising solution. This paper presents Camera-RADAR 3D Detection and Tracking (CR3DT), a camera-RADAR fusion model for 3D object detection, and Multi-Object Tracking (MOT). Building upon the foundations of the State-of-the-Art (SotA) camera-only BEVDet architecture, CR3DT demonstrates substantial improvements in both detection and tracking capabilities, by incorporating the spatial and velocity information of the RADAR sensor. Experimental results demonstrate an absolute improvement in detection performance of 5.3% in mean Average Precision (mAP) and a 14.9% increase in Average Multi-Object Tracking Accuracy (AMOTA) on the nuScenes dataset when leveraging both modalities. CR3DT bridges the gap between high-performance and cost-effective perception systems in autonomous driving, by capitalizing on the ubiquitous presence of RADAR in automotive applications.
☆ Controlled Training Data Generation with Diffusion Models
In this work, we present a method to control a text-to-image generative model to produce training data specifically "useful" for supervised learning. Unlike previous works that employ an open-loop approach and pre-define prompts to generate new data using either a language model or human expertise, we develop an automated closed-loop system which involves two feedback mechanisms. The first mechanism uses feedback from a given supervised model and finds adversarial prompts that result in image generations that maximize the model loss. While these adversarial prompts result in diverse data informed by the model, they are not informed of the target distribution, which can be inefficient. Therefore, we introduce the second feedback mechanism that guides the generation process towards a certain target distribution. We call the method combining these two mechanisms Guided Adversarial Prompts. We perform our evaluations on different tasks, datasets and architectures, with different types of distribution shifts (spuriously correlated data, unseen domains) and demonstrate the efficiency of the proposed feedback mechanisms compared to open-loop approaches.
comment: Project page at https://adversarial-prompts.epfl.ch/
☆ WSCLoc: Weakly-Supervised Sparse-View Camera Relocalization
Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for $\mathfrak{sim}(3)$ optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.
☆ Hyperbolic Metric Learning for Visual Outlier Detection
Out-Of-Distribution (OOD) detection is critical to deploy deep learning models in safety-critical applications. However, the inherent hierarchical concept structure of visual data, which is instrumental to OOD detection, is often poorly captured by conventional methods based on Euclidean geometry. This work proposes a metric framework that leverages the strengths of Hyperbolic geometry for OOD detection. Inspired by previous works that refine the decision boundary for OOD data with synthetic outliers, we extend this method to Hyperbolic space. Interestingly, we find that synthetic outliers do not benefit OOD detection in Hyperbolic space as they do in Euclidean space. Furthermore we explore the relationship between OOD detection performance and Hyperbolic embedding dimension, addressing practical concerns in resource-constrained environments. Extensive experiments show that our framework improves the FPR95 for OOD detection from 22\% to 15\% and from 49% to 28% on CIFAR-10 and CIFAR-100 respectively compared to Euclidean methods.
☆ Spectral Motion Alignment for Video Motion Transfer using Diffusion Models
The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models (VDMs) have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
comment: Project page: https://geonyeong-park.github.io/spectral-motion-alignment/
Self-Supervised Backbone Framework for Diverse Agricultural Vision Tasks
Computer vision in agriculture is game-changing with its ability to transform farming into a data-driven, precise, and sustainable industry. Deep learning has empowered agriculture vision to analyze vast, complex visual data, but heavily rely on the availability of large annotated datasets. This remains a bottleneck as manual labeling is error-prone, time-consuming, and expensive. The lack of efficient labeling approaches inspired us to consider self-supervised learning as a paradigm shift, learning meaningful feature representations from raw agricultural image data. In this work, we explore how self-supervised representation learning unlocks the potential applicability to diverse agriculture vision tasks by eliminating the need for large-scale annotated datasets. We propose a lightweight framework utilizing SimCLR, a contrastive learning approach, to pre-train a ResNet-50 backbone on a large, unannotated dataset of real-world agriculture field images. Our experimental analysis and results indicate that the model learns robust features applicable to a broad range of downstream agriculture tasks discussed in the paper. Additionally, the reduced reliance on annotated data makes our approach more cost-effective and accessible, paving the way for broader adoption of computer vision in agriculture.
☆ Reasoning-Enhanced Object-Centric Learning for Videos
Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experiment results on various datasets show that STATM can significantly enhance object-centric learning capabilities of slot-based video models.
☆ IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection CVPR 2024
Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at: https://github.com/yinjunbo/IS-Fusion.
comment: Accepted to CVPR 2024; Code: https://github.com/yinjunbo/IS-Fusion
☆ WEEP: A method for spatial interpretation of weakly supervised CNN models in computational pathology
Deep learning enables the modelling of high-resolution histopathology whole-slide images (WSI). Weakly supervised learning of tile-level data is typically applied for tasks where labels only exist on the patient or WSI level (e.g. patient outcomes or histological grading). In this context, there is a need for improved spatial interpretability of predictions from such models. We propose a novel method, Wsi rEgion sElection aPproach (WEEP), for model interpretation. It provides a principled yet straightforward way to establish the spatial area of WSI required for assigning a particular prediction label. We demonstrate WEEP on a binary classification task in the area of breast cancer computational pathology. WEEP is easy to implement, is directly connected to the model-based decision process, and offers information relevant to both research and diagnostic applications.
☆ Shadow Generation for Composite Image Using Diffusion model CVPR2024
In the realm of image composition, generating realistic shadow for the inserted foreground remains a formidable challenge. Previous works have developed image-to-image translation models which are trained on paired training data. However, they are struggling to generate shadows with accurate shapes and intensities, hindered by data scarcity and inherent task complexity. In this paper, we resort to foundation model with rich prior knowledge of natural shadow images. Specifically, we first adapt ControlNet to our task and then propose intensity modulation modules to improve the shadow intensity. Moreover, we extend the small-scale DESOBA dataset to DESOBAv2 using a novel data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2 datasets as well as real composite images demonstrate the superior capability of our model for shadow generation task. The dataset, code, and model are released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2.
comment: accepted by CVPR2024
☆ LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example
Recent advances in 3D face stylization have made significant strides in few to zero-shot settings. However, the degree of stylization achieved by existing methods is often not sufficient for practical applications because they are mostly based on statistical 3D Morphable Models (3DMM) with limited variations. To this end, we propose a method that can produce a highly stylized 3D face model with desired topology. Our methods train a surface deformation network with 3DMM and translate its domain to the target style using a paired exemplar. The network achieves stylization of the 3D face mesh by mimicking the style of the target using a differentiable renderer and directional CLIP losses. Additionally, during the inference process, we utilize a Mesh Agnostic Encoder (MAGE) that takes deformation target, a mesh of diverse topologies as input to the stylization process and encodes its shape into our latent space. The resulting stylized face model can be animated by commonly used 3DMM blend shapes. A set of quantitative and qualitative evaluations demonstrate that our method can produce highly stylized face meshes according to a given style and output them in a desired topology. We also demonstrate example applications of our method including image-based stylized avatar generation, linear interpolation of geometric styles, and facial animation of stylized avatars.
comment: 8 pages
☆ Anytime, Anywhere, Anyone: Investigating the Feasibility of Segment Anything Model for Crowd-Sourcing Medical Image Annotations
Curating annotations for medical image segmentation is a labor-intensive and time-consuming task that requires domain expertise, resulting in "narrowly" focused deep learning (DL) models with limited translational utility. Recently, foundation models like the Segment Anything Model (SAM) have revolutionized semantic segmentation with exceptional zero-shot generalizability across various domains, including medical imaging, and hold a lot of promise for streamlining the annotation process. However, SAM has yet to be evaluated in a crowd-sourced setting to curate annotations for training 3D DL segmentation models. In this work, we explore the potential of SAM for crowd-sourcing "sparse" annotations from non-experts to generate "dense" segmentation masks for training 3D nnU-Net models, a state-of-the-art DL segmentation model. Our results indicate that while SAM-generated annotations exhibit high mean Dice scores compared to ground-truth annotations, nnU-Net models trained on SAM-generated annotations perform significantly worse than nnU-Net models trained on ground-truth annotations ($p<0.001$, all).
☆ GCN-DevLSTM: Path Development for Skeleton-Based Action Recognition
Skeleton-based action recognition (SAR) in videos is an important but challenging task in computer vision. The recent state-of-the-art models for SAR are primarily based on graph convolutional neural networks (GCNs), which are powerful in extracting the spatial information of skeleton data. However, it is yet clear that such GCN-based models can effectively capture the temporal dynamics of human action sequences. To this end, we propose the DevLSTM module, which exploits the path development -- a principled and parsimonious representation for sequential data by leveraging the Lie group structure. The path development, originated from Rough path theory, can effectively capture the order of events in high-dimensional stream data with massive dimension reduction and consequently enhance the LSTM module substantially. Our proposed G-DevLSTM module can be conveniently plugged into the temporal graph, complementing existing advanced GCN-based models. Our empirical studies on the NTU60, NTU120 and Chalearn2013 datasets demonstrate that our proposed hybrid model significantly outperforms the current best-performing methods in SAR tasks. The code is available at https://github.com/DeepIntoStreams/GCN-DevLSTM.
☆ MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in obvious cases, especially due to the modality bias learned from statistically biased datasets. From these problems, we anticipate that maybe understanding the complementary information itself is difficult to achieve from vision-only models. Accordingly, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework, which incorporates Large Language Models (LLMs) to understand the complementary information at the semantic level and further enhance the fusion process. Specifically, we generate text descriptions of the pedestrian in each RGB and thermal modality and design a Multispectral Chain-of-Thought (MSCoT) prompting, which models a step-by-step process to facilitate cross-modal reasoning at the semantic level and perform accurate detection. Moreover, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing vision-driven and language-driven detections. Extensive experiments validate that MSCoTDet improves multispectral pedestrian detection.
☆ DITTO: Demonstration Imitation by Trajectory Transformation IROS 2024
Teaching robots new skills quickly and conveniently is crucial for the broader adoption of robotic systems. In this work, we address the problem of one-shot imitation from a single human demonstration, given by an RGB-D video recording through a two-stage process. In the first stage which is offline, we extract the trajectory of the demonstration. This entails segmenting manipulated objects and determining their relative motion in relation to secondary objects such as containers. Subsequently, in the live online trajectory generation stage, we first \mbox{re-detect} all objects, then we warp the demonstration trajectory to the current scene, and finally, we trace the trajectory with the robot. To complete these steps, our method makes leverages several ancillary models, including those for segmentation, relative object pose estimation, and grasp prediction. We systematically evaluate different combinations of correspondence and re-detection methods to validate our design decision across a diverse range of tasks. Specifically, we collect demonstrations of ten different tasks including pick-and-place tasks as well as articulated object manipulation. Finally, we perform extensive evaluations on a real robot system to demonstrate the effectiveness and utility of our approach in real-world scenarios. We make the code publicly available at http://ditto.cs.uni-freiburg.de.
comment: 8 pages, 4 figures, 3 tables, submitted to IROS 2024
☆ Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion
The landscape of deep learning research is moving towards innovative strategies to harness the true potential of data. Traditionally, emphasis has been on scaling model architectures, resulting in large and complex neural networks, which can be difficult to train with limited computational resources. However, independently of the model size, data quality (i.e. amount and variability) is still a major factor that affects model generalization. In this work, we propose a novel technique to exploit available data through the use of automatic data augmentation for the tasks of image classification and semantic segmentation. We introduce the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos. Compared to previous approaches, DAS is extremely fast and flexible, allowing the search on very large search spaces in less than a GPU day. Our intuition is that the increased receptive field in the temporal dimension provided by DAS could lead to benefits also to the spatial receptive field. More specifically, we leverage DAS to guide the reshaping of the spatial receptive field by selecting task-dependant transformations. As a result, compared to standard augmentation alternatives, we improve in terms of accuracy on ImageNet, Cifar10, Cifar100, Tiny-ImageNet, Pascal-VOC-2012 and CityScapes datasets when plugging-in our DAS over different light-weight video backbones.
☆ SFOD: Spiking Fusion Object Detector CVPR2024
Event cameras, characterized by high temporal resolution, high dynamic range, low power consumption, and high pixel bandwidth, offer unique capabilities for object detection in specialized contexts. Despite these advantages, the inherent sparsity and asynchrony of event data pose challenges to existing object detection algorithms. Spiking Neural Networks (SNNs), inspired by the way the human brain codes and processes information, offer a potential solution to these difficulties. However, their performance in object detection using event cameras is limited in current implementations. In this paper, we propose the Spiking Fusion Object Detector (SFOD), a simple and efficient approach to SNN-based object detection. Specifically, we design a Spiking Fusion Module, achieving the first-time fusion of feature maps from different scales in SNNs applied to event cameras. Additionally, through integrating our analysis and experiments conducted during the pretraining of the backbone network on the NCAR dataset, we delve deeply into the impact of spiking decoding strategies and loss functions on model performance. Thereby, we establish state-of-the-art classification results based on SNNs, achieving 93.7\% accuracy on the NCAR dataset. Experimental results on the GEN1 detection dataset demonstrate that the SFOD achieves a state-of-the-art mAP of 32.1\%, outperforming existing SNN-based approaches. Our research not only underscores the potential of SNNs in object detection with event cameras but also propels the advancement of SNNs. Code is available at https://github.com/yimeng-fan/SFOD.
comment: Accepted by CVPR2024
☆ PDE-CNNs: Axiomatic Derivations and Applications
PDE-based Group Convolutional Neural Networks (PDE-G-CNNs) utilize solvers of geometrically meaningful evolution PDEs as substitutes for the conventional components in G-CNNs. PDE-G-CNNs offer several key benefits all at once: fewer parameters, inherent equivariance, better performance, data efficiency, and geometric interpretability. In this article we focus on Euclidean equivariant PDE-G-CNNs where the feature maps are two dimensional throughout. We call this variant of the framework a PDE-CNN. We list several practically desirable axioms and derive from these which PDEs should be used in a PDE-CNN. Here our approach to geometric learning via PDEs is inspired by the axioms of classical linear and morphological scale-space theory, which we generalize by introducing semifield-valued signals. Furthermore, we experimentally confirm for small networks that PDE-CNNs offer fewer parameters, better performance, and data efficiency in comparison to CNNs. We also investigate what effect the use of different semifields has on the performance of the models.
☆ LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels CVPR 2024
Autonomous systems need to process large-scale, sparse, and irregular point clouds with limited compute resources. Consequently, it is essential to develop LiDAR perception methods that are both efficient and effective. Although naively enlarging 3D kernel size can enhance performance, it will also lead to a cubically-increasing overhead. Therefore, it is crucial to develop streamlined 3D large kernel designs that eliminate redundant weights and work effectively with larger kernels. In this paper, we propose an efficient and effective Large Sparse Kernel 3D Neural Network (LSK3DNet) that leverages dynamic pruning to amplify the 3D kernel size. Our method comprises two core components: Spatial-wise Dynamic Sparsity (SDS) and Channel-wise Weight Selection (CWS). SDS dynamically prunes and regrows volumetric weights from the beginning to learn a large sparse 3D kernel. It not only boosts performance but also significantly reduces model size and computational cost. Moreover, CWS selects the most important channels for 3D convolution during training and subsequently prunes the redundant channels to accelerate inference for 3D vision tasks. We demonstrate the effectiveness of LSK3DNet on three benchmark datasets and five tracks compared with classical models and large kernel designs. Notably, LSK3DNet achieves the state-of-the-art performance on SemanticKITTI (i.e., 75.6% on single-scan and 63.4% on multi-scan), with roughly 40% model size reduction and 60% computing operations reduction compared to the naive large 3D kernel model.
comment: Accepted at CVPR 2024; Project page: https://github.com/FengZicai/LSK3DNet
☆ FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos
Digitising the 3D world into a clean, CAD model-based representation has important applications for augmented reality and robotics. Current state-of-the-art methods are computationally intensive as they individually encode each detected object and optimise CAD alignments in a second stage. In this work, we propose FastCAD, a real-time method that simultaneously retrieves and aligns CAD models for all objects in a given scene. In contrast to previous works, we directly predict alignment parameters and shape embeddings. We achieve high-quality shape retrievals by learning CAD embeddings in a contrastive learning framework and distilling those into FastCAD. Our single-stage method accelerates the inference time by a factor of 50 compared to other methods operating on RGB-D scans while outperforming them on the challenging Scan2CAD alignment benchmark. Further, our approach collaborates seamlessly with online 3D reconstruction techniques. This enables the real-time generation of precise CAD model-based reconstructions from videos at 10 FPS. Doing so, we significantly improve the Scan2CAD alignment accuracy in the video setting from 43.0% to 48.2% and the reconstruction accuracy from 22.9% to 29.6%.
☆ Infrastructure-Assisted Collaborative Perception in Automated Valet Parking: A Safety Perspective
Environmental perception in Automated Valet Parking (AVP) has been a challenging task due to severe occlusions in parking garages. Although Collaborative Perception (CP) can be applied to broaden the field of view of connected vehicles, the limited bandwidth of vehicular communications restricts its application. In this work, we propose a BEV feature-based CP network architecture for infrastructure-assisted AVP systems. The model takes the roadside camera and LiDAR as optional inputs and adaptively fuses them with onboard sensors in a unified BEV representation. Autoencoder and downsampling are applied for channel-wise and spatial-wise dimension reduction, while sparsification and quantization further compress the feature map with little loss in data precision. Combining these techniques, the size of a BEV feature map is effectively compressed to fit in the feasible data rate of the NR-V2X network. With the synthetic AVP dataset, we observe that CP can effectively increase perception performance, especially for pedestrians. Moreover, the advantage of infrastructure-assisted CP is demonstrated in two typical safety-critical scenarios in the AVP setting, increasing the maximum safe cruising speed by up to 3m/s in both scenarios.
comment: 7 pages, 7 figures, 4 tables, accepted by IEEE VTC2024-Spring
☆ A Multimodal Approach for Cross-Domain Image Retrieval
Image generators are gaining vast amount of popularity and have rapidly changed how digital content is created. With the latest AI technology, millions of high quality images are being generated by the public, which are constantly motivating the research community to push the limits of generative models to create more complex and realistic images. This paper focuses on Cross-Domain Image Retrieval (CDIR) which can be used as an additional tool to inspect collections of generated images by determining the level of similarity between images in a dataset. An ideal retrieval system would be able to generalize to unseen complex images from multiple domains (e.g., photos, drawings and paintings). To address this goal, we propose a novel caption-matching approach that leverages multimodal language-vision architectures pre-trained on large datasets. The method is tested on DomainNet and Office-Home datasets and consistently achieves state-of-the-art performance over the latest approaches in the literature for cross-domain image retrieval. In order to verify the effectiveness with AI-generated images, the method was also put to test with a database composed by samples collected from Midjourney, which is a widely used generative platform for content creation.
☆ An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning
In recent years, Deep Learning has gained popularity for its ability to solve complex classification tasks, increasingly delivering better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure how similar are the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.
☆ Modular Deep Active Learning Framework for Image Annotation: A Technical Report for the Ophthalmo-AI Project
Image annotation is one of the most essential tasks for guaranteeing proper treatment for patients and tracking progress over the course of therapy in the field of medical imaging and disease diagnosis. However, manually annotating a lot of 2D and 3D imaging data can be extremely tedious. Deep Learning (DL) based segmentation algorithms have completely transformed this process and made it possible to automate image segmentation. By accurately segmenting medical images, these algorithms can greatly minimize the time and effort necessary for manual annotation. Additionally, by incorporating Active Learning (AL) methods, these segmentation algorithms can perform far more effectively with a smaller amount of ground truth data. We introduce MedDeepCyleAL, an end-to-end framework implementing the complete AL cycle. It provides researchers with the flexibility to choose the type of deep learning model they wish to employ and includes an annotation tool that supports the classification and segmentation of medical images. The user-friendly interface allows for easy alteration of the AL and DL model settings through a configuration file, requiring no prior programming experience. While MedDeepCyleAL can be applied to any kind of image data, we have specifically applied it to ophthalmology data in this project.
comment: DFKI Technical Report
☆ Deep Generative Model based Rate-Distortion for Image Downscaling Assessment CVPR 2024
In this paper, we propose Image Downscaling Assessment by Rate-Distortion (IDA-RD), a novel measure to quantitatively evaluate image downscaling algorithms. In contrast to image-based methods that measure the quality of downscaled images, ours is process-based that draws ideas from rate-distortion theory to measure the distortion incurred during downscaling. Our main idea is that downscaling and super-resolution (SR) can be viewed as the encoding and decoding processes in the rate-distortion model, respectively, and that a downscaling algorithm that preserves more details in the resulting low-resolution (LR) images should lead to less distorted high-resolution (HR) images in SR. In other words, the distortion should increase as the downscaling algorithm deteriorates. However, it is non-trivial to measure this distortion as it requires the SR algorithm to be blind and stochastic. Our key insight is that such requirements can be met by recent SR algorithms based on deep generative models that can find all matching HR images for a given LR image on their learned image manifolds. Extensive experimental results show the effectiveness of our IDA-RD measure.
comment: Accepted at CVPR 2024
☆ Transfer CLIP for Generalizable Image Denoising CVPR2024
Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet, the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties, which are highly desirable for generalizable denoising. Leveraging these properties, we devise an asymmetrical encoder-decoder denoising network, which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises, including synthetic noise, real-world sRGB noise, and low-dose CT image noise, demonstrate the superior generalization ability of our method.
comment: Accepted by CVPR2024
☆ Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection ICCV2023
Current semi-supervised object detection (SSOD) algorithms typically assume class balanced datasets (PASCAL VOC etc.) or slightly class imbalanced datasets (MS-COCO, etc). This assumption can be easily violated since real world datasets can be extremely class imbalanced in nature, thus making the performance of semi-supervised object detectors far from satisfactory. Besides, the research for this problem in SSOD is severely under-explored. To bridge this research gap, we comprehensively study the class imbalance problem for SSOD under more challenging scenarios, thus forming the first experimental setting for class imbalanced SSOD (CI-SSOD). Moreover, we propose a simple yet effective gradient-based sampling framework that tackles the class imbalance problem from the perspective of two types of confirmation biases. To tackle confirmation bias towards majority classes, the gradient-based reweighting and gradient-based thresholding modules leverage the gradients from each class to fully balance the influence of the majority and minority classes. To tackle the confirmation bias from incorrect pseudo labels of minority classes, the class-rebalancing sampling module resamples unlabeled data following the guidance of the gradient-based reweighting module. Experiments on three proposed sub-tasks, namely MS-COCO, MS-COCO to Object365 and LVIS, suggest that our method outperforms current class imbalanced object detectors by clear margins, serving as a baseline for future research in CI-SSOD. Code will be available at https://github.com/nightkeepers/CI-SSOD.
comment: Accepted by ICCV2023
☆ EndoGSLAM: Real-Time Dense Reconstruction and Tracking in Endoscopic Surgeries using Gaussian Splatting
Precise camera tracking, high-fidelity 3D tissue reconstruction, and real-time online visualization are critical for intrabody medical imaging devices such as endoscopes and capsule robots. However, existing SLAM (Simultaneous Localization and Mapping) methods often struggle to achieve both complete high-quality surgical field reconstruction and efficient computation, restricting their intraoperative applications among endoscopic surgeries. In this paper, we introduce EndoGSLAM, an efficient SLAM approach for endoscopic surgeries, which integrates streamlined Gaussian representation and differentiable rasterization to facilitate over 100 fps rendering speed during online camera tracking and tissue reconstructing. Extensive experiments show that EndoGSLAM achieves a better trade-off between intraoperative availability and reconstruction quality than traditional or neural SLAM approaches, showing tremendous potential for endoscopic surgeries. The project page is at https://EndoGSLAM.loping151.com
☆ SYNCS: Synthetic Data and Contrastive Self-Supervised Training for Central Sulcus Segmentation
Bipolar disorder (BD) and schizophrenia (SZ) are severe mental disorders with profound societal impact. Identifying risk markers early is crucial for understanding disease progression and enabling preventive measures. The Danish High Risk and Resilience Study (VIA) focuses on understanding early disease processes, particularly in children with familial high risk (FHR). Understanding structural brain changes associated with these diseases during early stages is essential for effective interventions. The central sulcus (CS) is a prominent brain landmark related to brain regions involved in motor and sensory processing. Analyzing CS morphology can provide valuable insights into neurodevelopmental abnormalities in the FHR group. However, segmenting the central sulcus (CS) presents challenges due to its variability, especially in adolescents. This study introduces two novel approaches to improve CS segmentation: synthetic data generation to model CS variability and self-supervised pre-training with multi-task learning to adapt models to new cohorts. These methods aim to enhance segmentation performance across diverse populations, eliminating the need for extensive preprocessing.
☆ An Open-World, Diverse, Cross-Spatial-Temporal Benchmark for Dynamic Wild Person Re-Identification
Person re-identification (ReID) has made great strides thanks to the data-driven deep learning techniques. However, the existing benchmark datasets lack diversity, and models trained on these data cannot generalize well to dynamic wild scenarios. To meet the goal of improving the explicit generalization of ReID models, we develop a new Open-World, Diverse, Cross-Spatial-Temporal dataset named OWD with several distinct features. 1) Diverse collection scenes: multiple independent open-world and highly dynamic collecting scenes, including streets, intersections, shopping malls, etc. 2) Diverse lighting variations: long time spans from daytime to nighttime with abundant illumination changes. 3) Diverse person status: multiple camera networks in all seasons with normal/adverse weather conditions and diverse pedestrian appearances (e.g., clothes, personal belongings, poses, etc.). 4) Protected privacy: invisible faces for privacy critical applications. To improve the implicit generalization of ReID, we further propose a Latent Domain Expansion (LDE) method to develop the potential of source data, which decouples discriminative identity-relevant and trustworthy domain-relevant features and implicitly enforces domain-randomized identity feature space expansion with richer domain diversity to facilitate domain invariant representations. Our comprehensive evaluations with most benchmark datasets in the community are crucial for progress, although this work is far from the grand goal toward open-world and dynamic wild applications.
comment: Accepted by IJCV in 2024
☆ PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation IROS2024
Humans seemingly incorporate potential touch signals in their perception. Our goal is to equip robots with a similar capability, which we term \ourmodel. \ourmodel aims to predict the expected touch signal based on a visual patch representing the touched area. We frame this problem as the task of learning a low-dimensional visual-tactile embedding, wherein we encode a depth patch from which we decode the tactile signal. To accomplish this task, we employ ReSkin, an inexpensive and replaceable magnetic-based tactile sensor. Using ReSkin, we collect and train PseudoTouch on a dataset comprising aligned tactile and visual data pairs obtained through random touching of eight basic geometric shapes. We demonstrate the efficacy of PseudoTouch through its application to two downstream tasks: object recognition and grasp stability prediction. In the object recognition task, we evaluate the learned embedding's performance on a set of five basic geometric shapes and five household objects. Using PseudoTouch, we achieve an object recognition accuracy 84% after just ten touches, surpassing a proprioception baseline. For the grasp stability task, we use ACRONYM labels to train and evaluate a grasp success predictor using PseudoTouch's predictions derived from virtual depth information. Our approach yields an impressive 32% absolute improvement in accuracy compared to the baseline relying on partial point cloud data. We make the data, code, and trained models publicly available at http://pseudotouch.cs.uni-freiburg.de.
comment: 8 pages, 7 figures, 2 tables, submitted to IROS2024
☆ Improving cross-domain brain tissue segmentation in fetal MRI with synthetic data
Segmentation of fetal brain tissue from magnetic resonance imaging (MRI) plays a crucial role in the study of in utero neurodevelopment. However, automated tools face substantial domain shift challenges as they must be robust to highly heterogeneous clinical data, often limited in numbers and lacking annotations. Indeed, high variability of the fetal brain morphology, MRI acquisition parameters, and superresolution reconstruction (SR) algorithms adversely affect the model's performance when evaluated out-of-domain. In this work, we introduce FetalSynthSeg, a domain randomization method to segment fetal brain MRI, inspired by SynthSeg. Our results show that models trained solely on synthetic data outperform models trained on real data in out-ofdomain settings, validated on a 120-subject cross-domain dataset. Furthermore, we extend our evaluation to 40 subjects acquired using lowfield (0.55T) MRI and reconstructed with novel SR models, showcasing robustness across different magnetic field strengths and SR algorithms. Leveraging a generative synthetic approach, we tackle the domain shift problem in fetal brain MRI and offer compelling prospects for applications in fields with limited and highly heterogeneous data.
comment: 10 pages, 5 figures, 1 table
☆ UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
Vehicle trajectory prediction has increasingly relied on data-driven solutions, but their ability to scale to different data domains and the impact of larger dataset sizes on their generalization remain under-explored. While these questions can be studied by employing multiple datasets, it is challenging due to several discrepancies, \textit{e.g.,} in data formats, map resolution, and semantic annotation types. To address these challenges, we introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria, presenting new opportunities for the vehicle trajectory prediction field. In particular, using UniTraj, we conduct extensive experiments and find that model performance significantly drops when transferred to other datasets. However, enlarging data size and diversity can substantially improve performance, leading to a new state-of-the-art result for the nuScenes dataset. We provide insights into dataset characteristics to explain these findings. The code can be found here: \hyperlink{https://github.com/vita-epfl/UniTraj}{https://github.com/vita-epfl/UniTraj}.
☆ IFSENet : Harnessing Sparse Iterations for Interactive Few-shot Segmentation Excellence
Training a computer vision system to segment a novel class typically requires collecting and painstakingly annotating lots of images with objects from that class. Few-shot segmentation techniques reduce the required number of images to learn to segment a new class, but careful annotations of object boundaries are still required. On the other hand, interactive segmentation techniques only focus on incrementally improving the segmentation of one object at a time (typically, using clicks given by an expert) in a class-agnostic manner. We combine the two concepts to drastically reduce the effort required to train segmentation models for novel classes. Instead of trivially feeding interactive segmentation masks as ground truth to a few-shot segmentation model, we propose IFSENet, which can accept sparse supervision on a single or few support images in the form of clicks to generate masks on support (training, at least clicked upon once) as well as query (test, never clicked upon) images. To trade-off effort for accuracy flexibly, the number of images and clicks can be incrementally added to the support set to further improve the segmentation of support as well as query images. The proposed model approaches the accuracy of previous state-of-the-art few-shot segmentation models with considerably lower annotation effort (clicks instead of maps), when tested on Pascal and SBD datasets on query images. It also works well as an interactive segmentation method on support images.
☆ Cell Variational Information Bottleneck Network
In this work, we propose Cell Variational Information Bottleneck Network (cellVIB), a convolutional neural network using information bottleneck mechanism, which can be combined with the latest feedforward network architecture in an end-to-end training method. Our Cell Variational Information Bottleneck Network is constructed by stacking VIB cells, which generate feature maps with uncertainty. As layers going deeper, the regularization effect will gradually increase, instead of directly adding excessive regular constraints to the output layer of the model as in Deep VIB. Under each VIB cell, the feedforward process learns an independent mean term and an standard deviation term, and predicts the Gaussian distribution based on them. The feedback process is based on reparameterization trick for effective training. This work performs an extensive analysis on MNIST dataset to verify the effectiveness of each VIB cells, and provides an insightful analysis on how the VIB cells affect mutual information. Experiments conducted on CIFAR-10 also prove that our cellVIB is robust against noisy labels during training and against corrupted images during testing. Then, we validate our method on PACS dataset, whose results show that the VIB cells can significantly improve the generalization performance of the basic model. Finally, in a more complex representation learning task, face recognition, our network structure has also achieved very competitive results.
☆ Integrating multiscale topology in digital pathology with pyramidal graph convolutional networks
Graph convolutional networks (GCNs) have emerged as a powerful alternative to multiple instance learning with convolutional neural networks in digital pathology, offering superior handling of structural information across various spatial ranges - a crucial aspect of learning from gigapixel H&E-stained whole slide images (WSI). However, graph message-passing algorithms often suffer from oversmoothing when aggregating a large neighborhood. Hence, effective modeling of multi-range interactions relies on the careful construction of the graph. Our proposed multi-scale GCN (MS-GCN) tackles this issue by leveraging information across multiple magnification levels in WSIs. MS-GCN enables the simultaneous modeling of long-range structural dependencies at lower magnifications and high-resolution cellular details at higher magnifications, akin to analysis pipelines usually conducted by pathologists. The architecture's unique configuration allows for the concurrent modeling of structural patterns at lower magnifications and detailed cellular features at higher ones, while also quantifying the contribution of each magnification level to the prediction. Through testing on different datasets, MS-GCN demonstrates superior performance over existing single-magnification GCN methods. The enhancement in performance and interpretability afforded by our method holds promise for advancing computational pathology models, especially in tasks requiring extensive spatial context.
☆ Recent Trends in 3D Reconstruction of General Non-Rigid Scenes
Reconstructing models of the real world, including 3D geometry, appearance, and motion of real scenes, is essential for computer graphics and computer vision. It enables the synthesizing of photorealistic novel views, useful for the movie industry and AR/VR applications. It also facilitates the content creation necessary in computer games and AR/VR by avoiding laborious manual design processes. Further, such models are fundamental for intelligent computing systems that need to interpret real-world scenes and actions to act and interact safely with the human world. Notably, the world surrounding us is dynamic, and reconstructing models of dynamic, non-rigidly moving scenes is a severely underconstrained and challenging problem. This state-of-the-art report (STAR) offers the reader a comprehensive summary of state-of-the-art techniques with monocular and multi-view inputs such as data from RGB and RGB-D sensors, among others, conveying an understanding of different approaches, their potential applications, and promising further research directions. The report covers 3D reconstruction of general non-rigid scenes and further addresses the techniques for scene decomposition, editing and controlling, and generalizable and generative modeling. More specifically, we first review the common and fundamental concepts necessary to understand and navigate the field and then discuss the state-of-the-art techniques by reviewing recent approaches that use traditional and machine-learning-based neural representations, including a discussion on the newly enabled applications. The STAR is concluded with a discussion of the remaining limitations and open challenges.
comment: 42 pages, 18 figures, 5 tables; State-of-the-Art Report at EUROGRAPHICS 2024
☆ Towards a Comprehensive, Efficient and Promptable Anatomic Structure Segmentation Model using 3D Whole-body CT Scans
Segment anything model (SAM) demonstrates strong generalization ability on natural image segmentation. However, its direct adaption in medical image segmentation tasks shows significant performance drops with inferior accuracy and unstable results. It may also requires an excessive number of prompt points to obtain a reasonable accuracy. For segmenting 3D radiological CT or MRI scans, a 2D SAM model has to separately handle hundreds of 2D slices. Although quite a few studies explore adapting SAM into medical image volumes, the efficiency of 2D adaption methods is unsatisfactory and 3D adaptation methods only capable of segmenting specific organs/tumors. In this work, we propose a comprehensive and scalable 3D SAM model for whole-body CT segmentation, named CT-SAM3D. Instead of adapting SAM, we propose a 3D promptable segmentation model using a (nearly) fully labeled CT dataset. To train CT-SAM3D effectively, ensuring the model's accurate responses to higher-dimensional spatial prompts is crucial, and 3D patch-wise training is required due to GPU memory constraints. For this purpose, we propose two key technical developments: 1) a progressively and spatially aligned prompt encoding method to effectively encode click prompts in local 3D space; and 2) a cross-patch prompt learning scheme to capture more 3D spatial context, which is beneficial for reducing the editing workloads when interactively prompting on large organs. CT-SAM3D is trained and validated using a curated dataset of 1204 CT scans containing 107 whole-body anatomies, reporting significantly better quantitative performance against all previous SAM-derived models by a large margin with much fewer click prompts. Our model can handle segmenting unseen organ as well. Code, data, and our 3D interactive segmentation tool with quasi-real-time responses will be made publicly available.
☆ Subjective Quality Assessment of Compressed Tone-Mapped High Dynamic Range Videos
High Dynamic Range (HDR) videos are able to represent wider ranges of contrasts and colors than Standard Dynamic Range (SDR) videos, giving more vivid experiences. Due to this, HDR videos are expected to grow into the dominant video modality of the future. However, HDR videos are incompatible with existing SDR displays, which form the majority of affordable consumer displays on the market. Because of this, HDR videos must be processed by tone-mapping them to reduced bit-depths to service a broad swath of SDR-limited video consumers. Here, we analyze the impact of tone-mapping operators on the visual quality of streaming HDR videos. To this end, we built the first large-scale subjectively annotated open-source database of compressed tone-mapped HDR videos, containing 15,000 tone-mapped sequences derived from 40 unique HDR source contents. The videos in the database were labeled with more than 750,000 subjective quality annotations, collected from more than 1,600 unique human observers. We demonstrate the usefulness of the new subjective database by benchmarking objective models of visual quality on it. We envision that the new LIVE Tone-Mapped HDR (LIVE-TMHDR) database will enable significant progress on HDR video tone mapping and quality assessment in the future. To this end, we make the database freely available to the community at https://live.ece.utexas.edu/research/LIVE_TMHDR/index.html
☆ MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration
Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.
☆ Continual Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe. Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation: the introduction of new environments necessitates retraining with previously encountered environments to preserve their knowledge. This makes it difficult to train VLN agents that operate in the ever-changing real world. To address this limitation, we present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process. For the training and evaluation of CVLN agents, we re-arrange existing VLN datasets to propose two datasets: CVLN-I, focused on navigation via initial-instruction interpretation, and CVLN-D, aimed at navigation through dialogue with other agents. Furthermore, we propose two novel rehearsal-based methods for CVLN, Perplexity Replay (PerpR) and Episodic Self-Replay (ESR). PerpR prioritizes replaying challenging episodes based on action perplexity, while ESR replays previously predicted action logits to preserve learned behaviors. We demonstrate the effectiveness of the proposed methods on CVLN through extensive experiments.
☆ Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Large-scale Text-to-Image (TTI) models have become a common approach for generating training data in various generative fields. However, visual hallucinations, which contain perceptually critical defects, remain a concern, especially in non-photorealistic styles like cartoon characters. We propose a novel visual hallucination detection system for cartoon character images generated by TTI models. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. This research advances TTI models by mitigating visual hallucinations, expanding their potential in non-photorealistic domains.
comment: 11 pages, 12 figures, 1 table, Project page: https://gh-bumsookim.github.io/Cartoon-Hallucinations-Detection/
☆ Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild
Multimodal fusion is a significant method for most multimodal tasks. With the recent surge in the number of large pre-trained models, combining both multimodal fusion methods and pre-trained model features can achieve outstanding performance in many multimodal tasks. In this paper, we present our approach, which leverages both advantages for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation. We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features. Following preprocessing and interpolation or convolution to align the extracted features, different models are employed for modal fusion. Our code is available at GitHub - FulgenceWen/ABAW6th.
☆ Toward Tiny and High-quality Facial Makeup with Data Amplify Learning
Contemporary makeup approaches primarily hinge on unpaired learning paradigms, yet they grapple with the challenges of inaccurate supervision (e.g., face misalignment) and sophisticated facial prompts (including face parsing, and landmark detection). These challenges prohibit low-cost deployment of facial makeup models, especially on mobile devices. To solve above problems, we propose a brand-new learning paradigm, termed "Data Amplify Learning (DAL)," alongside a compact makeup model named "TinyBeauty." The core idea of DAL lies in employing a Diffusion-based Data Amplifier (DDA) to "amplify" limited images for the model training, thereby enabling accurate pixel-to-pixel supervision with merely a handful of annotations. Two pivotal innovations in DDA facilitate the above training approach: (1) A Residual Diffusion Model (RDM) is designed to generate high-fidelity detail and circumvent the detail vanishing problem in the vanilla diffusion models; (2) A Fine-Grained Makeup Module (FGMM) is proposed to achieve precise makeup control and combination while retaining face identity. Coupled with DAL, TinyBeauty necessitates merely 80K parameters to achieve a state-of-the-art performance without intricate face prompts. Meanwhile, TinyBeauty achieves a remarkable inference speed of up to 460 fps on the iPhone 13. Extensive experiments show that DAL can produce highly competitive makeup models using only 5 image pairs.
☆ An Integrated Neighborhood and Scale Information Network for Open-Pit Mine Change Detection in High-Resolution Remote Sensing Images
Open-pit mine change detection (CD) in high-resolution (HR) remote sensing images plays a crucial role in mineral development and environmental protection. Significant progress has been made in this field in recent years, largely due to the advancement of deep learning techniques. However, existing deep-learning-based CD methods encounter challenges in effectively integrating neighborhood and scale information, resulting in suboptimal performance. Therefore, by exploring the influence patterns of neighborhood and scale information, this paper proposes an Integrated Neighborhood and Scale Information Network (INSINet) for open-pit mine CD in HR remote sensing images. Specifically, INSINet introduces 8-neighborhood-image information to acquire a larger receptive field, improving the recognition of center image boundary regions. Drawing on techniques of skip connection, deep supervision, and attention mechanism, the multi-path deep supervised attention (MDSA) module is designed to enhance multi-scale information fusion and change feature extraction. Experimental analysis reveals that incorporating neighborhood and scale information enhances the F1 score of INSINet by 6.40%, with improvements of 3.08% and 3.32% respectively. INSINet outperforms existing methods with an Overall Accuracy of 97.69%, Intersection over Union of 71.26%, and F1 score of 83.22%. INSINet shows significance for open-pit mine CD in HR remote sensing images.
☆ Image Classification with Rotation-Invariant Variational Quantum Circuits
Variational quantum algorithms are gaining attention as an early application of Noisy Intermediate-Scale Quantum (NISQ) devices. One of the main problems of variational methods lies in the phenomenon of Barren Plateaus, present in the optimization of variational parameters. Adding geometric inductive bias to the quantum models has been proposed as a potential solution to mitigate this problem, leading to a new field called Geometric Quantum Machine Learning. In this work, an equivariant architecture for variational quantum classifiers is introduced to create a label-invariant model for image classification with $C_4$ rotational label symmetry. The equivariant circuit is benchmarked against two different architectures, and it is experimentally observed that the geometric approach boosts the model's performance. Finally, a classical equivariant convolution operation is proposed to extend the quantum model for the processing of larger images, employing the resources available in NISQ devices.
comment: 9 pages, 9 figures
☆ VRSO: Visual-Centric Reconstruction for Static Object Annotation
As a part of the perception results of intelligent driving systems, static object detection (SOD) in 3D space provides crucial cues for driving environment understanding. With the rapid deployment of deep neural networks for SOD tasks, the demand for high-quality training samples soars. The traditional, also reliable, way is manual labeling over the dense LiDAR point clouds and reference images. Though most public driving datasets adopt this strategy to provide SOD ground truth (GT), it is still expensive (requires LiDAR scanners) and low-efficient (time-consuming and unscalable) in practice. This paper introduces VRSO, a visual-centric approach for static object annotation. VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual labeling is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline. (3) Experiments on the Waymo Open Dataset show that the mean reprojection error from VRSO annotation is only 2.6 pixels, around four times lower than the Waymo labeling (10.6 pixels). Source code is available at: https://github.com/CaiYingFeng/VRSO.
comment: submitted to iros 2024
☆ BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation
3D instance segmentation (3DIS) is a crucial task, but point-level annotations are tedious in fully supervised settings. Thus, using bounding boxes (bboxes) as annotations has shown great potential. The current mainstream approach is a two-step process, involving the generation of pseudo-labels from box annotations and the training of a 3DIS network with the pseudo-labels. However, due to the presence of intersections among bboxes, not every point has a determined instance label, especially in overlapping areas. To generate higher quality pseudo-labels and achieve more precise weakly supervised 3DIS results, we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation (BSNet), which devises a novel pseudo-labeler called Simulation-assisted Transformer. The labeler consists of two main components. The first is Simulation-assisted Mean Teacher, which introduces Mean Teacher for the first time in this task and constructs simulated samples to assist the labeler in acquiring prior knowledge about overlapping areas. To better model local-global structure, we also propose Local-Global Aware Attention as the decoder for teacher and student labelers. Extensive experiments conducted on the ScanNetV2 and S3DIS datasets verify the superiority of our designs. Code is available at \href{https://github.com/peoplelu/BSNet}{https://github.com/peoplelu/BSNet}.
☆ Vehicle Detection Performance in Nordic Region ICPR2024
This paper addresses the critical challenge of vehicle detection in the harsh winter conditions in the Nordic regions, characterized by heavy snowfall, reduced visibility, and low lighting. Due to their susceptibility to environmental distortions and occlusions, traditional vehicle detection methods have struggled in these adverse conditions. The advanced proposed deep learning architectures brought promise, yet the unique difficulties of detecting vehicles in Nordic winters remain inadequately addressed. This study uses the Nordic Vehicle Dataset (NVD), which has UAV images from northern Sweden, to evaluate the performance of state-of-the-art vehicle detection algorithms under challenging weather conditions. Our methodology includes a comprehensive evaluation of single-stage, two-stage, and transformer-based detectors against the NVD. We propose a series of enhancements tailored to each detection framework, including data augmentation, hyperparameter tuning, transfer learning, and novel strategies designed explicitly for the DETR model. Our findings not only highlight the limitations of current detection systems in the Nordic environment but also offer promising directions for enhancing these algorithms for improved robustness and accuracy in vehicle detection amidst the complexities of winter landscapes. The code and the dataset are available at https://nvd.ltu-ai.dev
comment: submitted to ICPR2024
☆ Extracting Human Attention through Crowdsourced Patch Labeling
In image classification, a significant problem arises from bias in the datasets. When it contains only specific types of images, the classifier begins to rely on shortcuts - simplistic and erroneous rules for decision-making. This leads to high performance on the training dataset but inferior results on new, varied images, as the classifier's generalization capability is reduced. For example, if the images labeled as mustache consist solely of male figures, the model may inadvertently learn to classify images by gender rather than the presence of a mustache. One approach to mitigate such biases is to direct the model's attention toward the target object's location, usually marked using bounding boxes or polygons for annotation. However, collecting such annotations requires substantial time and human effort. Therefore, we propose a novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias. Our method consists of two steps. First, we extract the approximate location of a target using a pre-trained saliency detection model supplemented by human verification for accuracy. Then, we determine the human-attentive area in the image by iteratively dividing the image into smaller patches and employing crowdsourcing to ascertain whether each patch can be classified as the target object. We demonstrated the effectiveness of our method in mitigating bias through improved classification accuracy and the refined focus of the model. Also, crowdsourced experiments validate that our method collects human annotation up to 3.4 times faster than annotating object locations with polygons, significantly reducing the need for human resources. We conclude the paper by discussing the advantages of our method in a crowdsourcing context, mainly focusing on aspects of human errors and accessibility.
comment: 21 pages, 11 figures
☆ Cell Tracking according to Biological Needs -- Strong Mitosis-aware Random-finite Sets Tracker with Aleatoric Uncertainty
Cell tracking and segmentation assist biologists in extracting insights from large-scale microscopy time-lapse data. Driven by local accuracy metrics, current tracking approaches often suffer from a lack of long-term consistency. To address this issue, we introduce an uncertainty estimation technique for neural tracking-by-regression frameworks and incorporate it into our novel extended Poisson multi-Bernoulli mixture tracker. Our uncertainty estimation identifies uncertain associations within high-performing tracking-by-regression methods using problem-specific test-time augmentations. Leveraging this uncertainty, along with a novel mitosis-aware assignment problem formulation, our tracker resolves false associations and mitosis detections stemming from long-term conflicts. We evaluate our approach on nine competitive datasets and demonstrate that it outperforms the current state-of-the-art on biologically relevant metrics substantially, achieving improvements by a factor of approximately $5.75$. Furthermore, we uncover new insights into the behavior of tracking-by-regression uncertainty.
comment: 23 pages, 10 figures, 5 tables
☆ Clean-image Backdoor Attacks
To gather a significant quantity of annotated training data for high-performance image classification models, numerous companies opt to enlist third-party providers to label their unlabeled data. This practice is widely regarded as secure, even in cases where some annotated errors occur, as the impact of these minor inaccuracies on the final performance of the models is negligible and existing backdoor attacks require attacker's ability to poison the training images. Nevertheless, in this paper, we propose clean-image backdoor attacks which uncover that backdoors can still be injected via a fraction of incorrect labels without modifying the training images. Specifically, in our attacks, the attacker first seeks a trigger feature to divide the training images into two parts: those with the feature and those without it. Subsequently, the attacker falsifies the labels of the former part to a backdoor class. The backdoor will be finally implanted into the target model after it is trained on the poisoned data. During the inference phase, the attacker can activate the backdoor in two ways: slightly modifying the input image to obtain the trigger feature, or taking an image that naturally has the trigger feature as input. We conduct extensive experiments to demonstrate the effectiveness and practicality of our attacks. According to the experimental results, we conclude that our attacks seriously jeopardize the fairness and robustness of image classification models, and it is necessary to be vigilant about the incorrect labels in outsourced labeling.
☆ TexRO: Generating Delicate Textures of 3D Models by Recursive Optimization
This paper presents TexRO, a novel method for generating delicate textures of a known 3D mesh by optimizing its UV texture. The key contributions are two-fold. We propose an optimal viewpoint selection strategy, that finds the most miniature set of viewpoints covering all the faces of a mesh. Our viewpoint selection strategy guarantees the completeness of a generated result. We propose a recursive optimization pipeline that optimizes a UV texture at increasing resolutions, with an adaptive denoising method that re-uses existing textures for new texture generation. Through extensive experimentation, we demonstrate the superior performance of TexRO in terms of texture quality, detail preservation, visual consistency, and, notably runtime speed, outperforming other current methods. The broad applicability of TexRO is further confirmed through its successful use on diverse 3D models.
comment: Technical report. Project page: \href{https://3d-aigc.github.io/TexRO}{https://3d-aigc.github.io/TexRO}
☆ Tri-Perspective View Decomposition for Geometry-Aware Depth Completion CVPR 2024
Depth completion is a vital task for autonomous driving, as it involves reconstructing the precise 3D geometry of a scene from sparse and noisy depth measurements. However, most existing methods either rely only on 2D depth representations or directly incorporate raw 3D point clouds for compensation, which are still insufficient to capture the fine-grained 3D geometry of the scene. To address this challenge, we introduce Tri-Perspective view Decomposition (TPVD), a novel framework that can explicitly model 3D geometry. In particular, (1) TPVD ingeniously decomposes the original point cloud into three 2D views, one of which corresponds to the sparse depth input. (2) We design TPV Fusion to update the 2D TPV features through recurrent 2D-3D-2D aggregation, where a Distance-Aware Spherical Convolution (DASC) is applied. (3) By adaptively choosing TPV affinitive neighbors, the newly proposed Geometric Spatial Propagation Network (GSPN) further improves the geometric consistency. As a result, our TPVD outperforms existing methods on KITTI, NYUv2, and SUN RGBD. Furthermore, we build a novel depth completion dataset named TOFDC, which is acquired by the time-of-flight (TOF) sensor and the color camera on smartphones. Project page: https://yanzq95.github.io/projectpage/TOFDC/index.html
comment: Accepted to CVPR 2024
☆ ParFormer: Vision Transformer Baseline with Parallel Local Global Token Mixer and Convolution Attention Patch Embedding
This work presents ParFormer as an enhanced transformer architecture that allows the incorporation of different token mixers into a single stage, hence improving feature extraction capabilities. Integrating both local and global data allows for precise representation of short- and long-range spatial relationships without the need for computationally intensive methods such as shifting windows. Along with the parallel token mixer encoder, We offer the Convolutional Attention Patch Embedding (CAPE) as an enhancement of standard patch embedding to improve token mixer extraction with a convolutional attention module. Our comprehensive evaluation demonstrates that our ParFormer outperforms CNN-based and state-of-the-art transformer-based architectures in image classification and several complex tasks such as object recognition. The proposed CAPE has been demonstrated to benefit the overall MetaFormer architecture, even while utilizing the Identity Mapping Token Mixer, resulting in a 0.5\% increase in accuracy. The ParFormer models outperformed ConvNeXt and Swin Transformer for the pure convolution and transformer model in accuracy. Furthermore, our model surpasses the current leading hybrid transformer by reaching competitive Top-1 scores in the ImageNet-1K classification test. Specifically, our model variants with 11M, 23M, and 34M parameters achieve scores of 80.4\%, 82.1\%, and 83.1\%, respectively. Code: https://github.com/novendrastywn/ParFormer-CAPE-2024
☆ Magic for the Age of Quantized DNNs
Recently, the number of parameters in DNNs has explosively increased, as exemplified by LLMs (Large Language Models), making inference on small-scale computers more difficult. Model compression technology is, therefore, essential for integration into products. In this paper, we propose a method of quantization-aware training. We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional computation cost during inference. Then, we quantize the weights by the scaled round-clip function with the weight standardization. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions. We call this method Magic for the age of Quantised DNNs (MaQD). Experimental results show that our quantization method can be achieved with minimal accuracy degradation.
comment: 14 pages, 5 figures, 4 tables
☆ Improve Cross-domain Mixed Sampling with Guidance Training for Adaptive Segmentation
Unsupervised Domain Adaptation (UDA) endeavors to adjust models trained on a source domain to perform well on a target domain without requiring additional annotations. In the context of domain adaptive semantic segmentation, which tackles UDA for dense prediction, the goal is to circumvent the need for costly pixel-level annotations. Typically, various prevailing methods baseline rely on constructing intermediate domains via cross-domain mixed sampling techniques to mitigate the performance decline caused by domain gaps. However, such approaches generate synthetic data that diverge from real-world distributions, potentially leading the model astray from the true target distribution. To address this challenge, we propose a novel auxiliary task called Guidance Training. This task facilitates the effective utilization of cross-domain mixed sampling techniques while mitigating distribution shifts from the real world. Specifically, Guidance Training guides the model to extract and reconstruct the target-domain feature distribution from mixed data, followed by decoding the reconstructed target-domain features to make pseudo-label predictions. Importantly, integrating Guidance Training incurs minimal training overhead and imposes no additional inference burden. We demonstrate the efficacy of our approach by integrating it with existing methods, consistently improving performance. The implementation will be available at https://github.com/Wenlve-Zhou/Guidance-Training.
☆ Generative Active Learning for Image Synthesis Personalization
This paper presents a pilot study that explores the application of active learning, traditionally studied in the context of discriminative models, to generative models. We specifically focus on image synthesis personalization tasks. The primary challenge in conducting active learning on generative models lies in the open-ended nature of querying, which differs from the closed form of querying in discriminative models that typically target a single concept. We introduce the concept of anchor directions to transform the querying process into a semi-open problem. We propose a direction-based uncertainty sampling strategy to enable generative active learning and tackle the exploitation-exploration dilemma. Extensive experiments are conducted to validate the effectiveness of our approach, demonstrating that an open-source model can achieve superior performance compared to closed-source models developed by large companies, such as Google's StyleDrop. The source code is available at https://github.com/zhangxulu1996/GAL4Personalization.
☆ Piecewise-Linear Manifolds for Deep Metric Learning
Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.
comment: Accepted at CPAL 2024 (Oral)
☆ AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies
With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this paper, we propose AVT2-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an n-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves state-of-the-art performance intra- and cross-dataset Deepfake detection. Code is available at https://github.com/raining-dev/AVT2-DWF.
☆ Trajectory Regularization Enhances Self-Supervised Geometric Representation
Self-supervised learning (SSL) has proven effective in learning high-quality representations for various downstream tasks, with a primary focus on semantic tasks. However, its application in geometric tasks remains underexplored, partially due to the absence of a standardized evaluation method for geometric representations. To address this gap, we introduce a new pose-estimation benchmark for assessing SSL geometric representations, which demands training without semantic or pose labels and achieving proficiency in both semantic and geometric downstream tasks. On this benchmark, we study enhancing SSL geometric representations without sacrificing semantic classification accuracy. We find that leveraging mid-layer representations improves pose-estimation performance by 10-20%. Further, we introduce an unsupervised trajectory-regularization loss, which improves performance by an additional 4% and improves generalization ability on out-of-distribution data. We hope the proposed benchmark and methods offer new insights and improvements in self-supervised geometric representation learning.
☆ DreamFlow: High-Quality Text-to-3D Generation by Approximating Probability Flow ICLR 2024
Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolongs the optimization process. In this paper, we propose to enhance the text-to-3D optimization by leveraging the T2I diffusion prior in the generative sampling process with a predetermined timestep schedule. To this end, we interpret text-to3D optimization as a multi-view image-to-image translation problem, and propose a solution by approximating the probability flow. By leveraging the proposed novel optimization algorithm, we design DreamFlow, a practical three-stage coarseto-fine text-to-3D optimization framework that enables fast generation of highquality and high-resolution (i.e., 1024x1024) 3D contents. For example, we demonstrate that DreamFlow is 5 times faster than the existing state-of-the-art text-to-3D method, while producing more photorealistic 3D contents. Visit our project page (https://kyungmnlee.github.io/dreamflow.github.io/) for visualizations.
comment: ICLR 2024
GPT-Connect: Interaction between Text-Driven Human Motion Generator and 3D Scenes in a Training-free Manner
Recently, while text-driven human motion generation has received massive research attention, most existing text-driven motion generators are generally only designed to generate motion sequences in a blank background. While this is the case, in practice, human beings naturally perform their motions in 3D scenes, rather than in a blank background. Considering this, we here aim to perform scene-aware text-drive motion generation instead. Yet, intuitively training a separate scene-aware motion generator in a supervised way can require a large amount of motion samples to be troublesomely collected and annotated in a large scale of different 3D scenes. To handle this task rather in a relatively convenient manner, in this paper, we propose a novel GPT-connect framework. In GPT-connect, we enable scene-aware motion sequences to be generated directly utilizing the existing blank-background human motion generator, via leveraging ChatGPT to connect the existing motion generator with the 3D scene in a totally training-free manner. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework.
☆ CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However, creating these kinds of datasets is very costly and requires a substantial amount of labor. Famous face datasets don't have corresponding text captions, making it difficult to develop text conditional image generation models on these datasets. Some research has focused on developing text to image generation models using only images without text captions. Here, we propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities. On the FFHQ dataset, our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution. The pretrained models and codes will soon be available at https://github.com/INFINIQ-AI1/CLIPVQDiffusion
comment: 15 pages, 9 figures
☆ STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians
Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.
Survey on Modeling of Articulated Objects
3D modeling of articulated objects is a research problem within computer vision, graphics, and robotics. Its objective is to understand the shape and motion of the articulated components, represent the geometry and mobility of object parts, and create realistic models that reflect articulated objects in the real world. This survey provides a comprehensive overview of the current state-of-the-art in 3D modeling of articulated objects, with a specific focus on the task of articulated part perception and articulated object creation (reconstruction and generation). We systematically review and discuss the relevant literature from two perspectives: geometry processing and articulation modeling. Through this survey, we highlight the substantial progress made in these areas, outline the ongoing challenges, and identify gaps for future research. Our survey aims to serve as a foundational reference for researchers and practitioners in computer vision and graphics, offering insights into the complexities of articulated object modeling.
☆ Defying Imbalanced Forgetting in Class Incremental Learning AAAI2024
We observe a high level of imbalance in the accuracy of different classes in the same old task for the first time. This intriguing phenomenon, discovered in replay-based Class Incremental Learning (CIL), highlights the imbalanced forgetting of learned classes, as their accuracy is similar before the occurrence of catastrophic forgetting. This discovery remains previously unidentified due to the reliance on average incremental accuracy as the measurement for CIL, which assumes that the accuracy of classes within the same task is similar. However, this assumption is invalid in the face of catastrophic forgetting. Further empirical studies indicate that this imbalanced forgetting is caused by conflicts in representation between semantically similar old and new classes. These conflicts are rooted in the data imbalance present in replay-based CIL methods. Building on these insights, we propose CLass-Aware Disentanglement (CLAD) to predict the old classes that are more likely to be forgotten and enhance their accuracy. Importantly, CLAD can be seamlessly integrated into existing CIL methods. Extensive experiments demonstrate that CLAD consistently improves current replay-based methods, resulting in performance gains of up to 2.56%.
comment: AAAI2024
♻ ☆ Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting
We present a dense simultaneous localization and mapping (SLAM) method that uses 3D Gaussians as a scene representation. Our approach enables interactive-time reconstruction and photo-realistic rendering from real-world single-camera RGBD videos. To this end, we propose a novel effective strategy for seeding new Gaussians for newly explored areas and their effective online optimization that is independent of the scene size and thus scalable to larger scenes. This is achieved by organizing the scene into sub-maps which are independently optimized and do not need to be kept in memory. We further accomplish frame-to-model camera tracking by minimizing photometric and geometric losses between the input and rendered frames. The Gaussian representation allows for high-quality photo-realistic real-time rendering of real-world scenes. Evaluation on synthetic and real-world datasets demonstrates competitive or superior performance in mapping, tracking, and rendering compared to existing neural dense SLAM methods.
♻ ☆ Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
comment: Project page at https://videoshop-editing.github.io/
♻ ☆ VideoPoet: A Large Language Model for Zero-Shot Video Generation
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
comment: Project page: http://sites.research.google/videopoet/
♻ ☆ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
♻ ☆ SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery CVPR2024
Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.
comment: Accepted by CVPR2024
♻ ☆ Fast ODE-based Sampling for Diffusion Models in Around 5 Steps CVPR 2024
Sampling from diffusion models can be treated as solving the corresponding ordinary differential equations (ODEs), with the aim of obtaining an accurate solution with as few number of function evaluations (NFE) as possible. Recently, various fast samplers utilizing higher-order ODE solvers have emerged and achieved better performance than the initial first-order one. However, these numerical methods inherently result in certain approximation errors, which significantly degrades sample quality with extremely small NFE (e.g., around 5). In contrast, based on the geometric observation that each sampling trajectory almost lies in a two-dimensional subspace embedded in the ambient space, we propose Approximate MEan-Direction Solver (AMED-Solver) that eliminates truncation errors by directly learning the mean direction for fast diffusion sampling. Besides, our method can be easily used as a plugin to further improve existing ODE-based samplers. Extensive experiments on image synthesis with the resolution ranging from 32 to 512 demonstrate the effectiveness of our method. With only 5 NFE, we achieve 6.61 FID on CIFAR-10, 10.74 FID on ImageNet 64$\times$64, and 13.20 FID on LSUN Bedroom. Our code is available at https://github.com/zju-pi/diff-sampler.
comment: Accepted by CVPR 2024
♻ ☆ Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.
♻ ☆ Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision latency compared to existing naive kernels for 1-D and 2-D neighborhood attention respectively. We find certain inherent inefficiencies in all unfused neighborhood attention kernels that bound their performance and lower-precision scalability. We also developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision latency. We observe that our fused kernels successfully circumvent some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 496% and 113% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1607% and 581% in 1-D and 2-D problems respectively.
comment: Project page: https://github.com/SHI-Labs/NATTEN
♻ ☆ Semantics, Distortion, and Style Matter: Towards Source-free UDA for Panoramic Segmentation CVPR 2024
This paper addresses an interesting yet challenging problem -- source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation -- given only a pinhole image-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is nontrivial due to the semantic mismatches, style discrepancies, and inevitable distortion of panoramic images. To this end, we propose a novel method that utilizes Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) with a fixed FoV to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, the distinct projection discrepancies between source and target domains impede the direct knowledge transfer; thus, we propose a panoramic prototype adaptation module (PPAM) to integrate panoramic prototypes from the extracted knowledge for adaptation. We then impose the loss constraints on both predictions and prototypes and propose a cross-dual attention module (CDAM) at the feature level to better align the spatial and channel characteristics across the domains and projections. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our method achieves significantly better performance than prior SFUDA methods for pinhole-to-panoramic adaptation.
comment: Accepted to CVPR 2024
♻ ☆ Win-Win: Training High-Resolution Vision Transformers from Two Windows ICLR 2024
Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to three dense prediction tasks with high-resolution data. First, we show on the task of semantic segmentation that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. Second, we confirm this result on the task of monocular depth prediction. Third, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an order of magnitude faster inference than the best competitor.
comment: ICLR 2024
♻ ☆ Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images ICLR 2024
Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
comment: Accepted by ICLR 2024
♻ ☆ Residual Denoising Diffusion Models CVPR2024
We propose residual denoising diffusion models (RDDM), a novel dual diffusion process that decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. This dual diffusion framework expands the denoising-based diffusion models, initially uninterpretable for image restoration, into a unified and interpretable model for both image generation and restoration by introducing residuals. Specifically, our residual diffusion represents directional diffusion from the target image to the degraded input image and explicitly guides the reverse generation process for image restoration, while noise diffusion represents random perturbations in the diffusion process. The residual prioritizes certainty, while the noise emphasizes diversity, enabling RDDM to effectively unify tasks with varying certainty or diversity requirements, such as image generation and restoration. We demonstrate that our sampling process is consistent with that of DDPM and DDIM through coefficient transformation, and propose a partially path-independent generation process to better understand the reverse process. Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a batch size of 1, to compete with state-of-the-art image restoration methods. We provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/nachifur/RDDM).
comment: Accepted to CVPR2024
♻ ☆ VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding
The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art vision models, thereby facilitating the development of vision-oriented AI. VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts. Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agent
comment: 12 pages, 7 figures, pending conference
♻ ☆ Toulouse Hyperspectral Data Set: a benchmark data set to assess semi-supervised spectral representation learning and pixel-wise classification techniques
Airborne hyperspectral images can be used to map the land cover in large urban areas, thanks to their very high spatial and spectral resolutions on a wide spectral domain. While the spectral dimension of hyperspectral images is highly informative of the chemical composition of the land surface, the use of state-of-the-art machine learning algorithms to map the land cover has been dramatically limited by the availability of training data. To cope with the scarcity of annotations, semi-supervised and self-supervised techniques have lately raised a lot of interest in the community. Yet, the publicly available hyperspectral data sets commonly used to benchmark machine learning models are not totally suited to evaluate their generalization performances due to one or several of the following properties: a limited geographical coverage (which does not reflect the spectral diversity in metropolitan areas), a small number of land cover classes and a lack of appropriate standard train / test splits for semi-supervised and self-supervised learning. Therefore, we release in this paper the Toulouse Hyperspectral Data Set that stands out from other data sets in the above-mentioned respects in order to meet key issues in spectral representation learning and classification over large-scale hyperspectral images with very few labeled pixels. Besides, we discuss and experiment self-supervised techniques for spectral representation learning, including the Masked Autoencoder, and establish a baseline for pixel-wise classification achieving 85% overall accuracy and 77% F1 score. The Toulouse Hyperspectral Data Set and our code are publicly available at https://www.toulouse-hyperspectral-data-set.com and https://www.github.com/Romain3Ch216/tlse-experiments, respectively.
comment: 17 pages, 13 figures
♻ ☆ ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation IJCNN 2024
Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utilization of amodal features through the amodal-to-visible can confuse the visible features due to the extra information of occluded/hidden segments not presented in visible display. Consequently, this compromised quality of visible features during the subsequent visible-to-amodal transition. To tackle this issue, we introduce ShapeFormer, a decoupled Transformer-based model with a visible-to-amodal transition. It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions. ShapeFormer comprises three key modules: (i) Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever aims to provide shape prior knowledge. Comprehensive experiments and extensive ablation studies across various AIS benchmarks demonstrate the effectiveness of our ShapeFormer. The code is available at: https://github.com/UARK-AICV/ShapeFormer
comment: Accepted to IJCNN 2024
♻ ☆ S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes CVPR 2024
Current 3D stylization methods often assume static scenes, which violates the dynamic nature of our real world. To address this limitation, we present S-DyRF, a reference-based spatio-temporal stylization method for dynamic neural radiance fields. However, stylizing dynamic 3D scenes is inherently challenging due to the limited availability of stylized reference images along the temporal axis. Our key insight lies in introducing additional temporal cues besides the provided reference. To this end, we generate temporal pseudo-references from the given stylized reference. These pseudo-references facilitate the propagation of style information from the reference to the entire dynamic 3D scene. For coarse style transfer, we enforce novel views and times to mimic the style details present in pseudo-references at the feature level. To preserve high-frequency details, we create a collection of stylized temporal pseudo-rays from temporal pseudo-references. These pseudo-rays serve as detailed and explicit stylization guidance for achieving fine style transfer. Experiments on both synthetic and real-world datasets demonstrate that our method yields plausible stylized results of space-time view synthesis on dynamic 3D scenes.
comment: Accepted by CVPR 2024. Project page: https://xingyi-li.github.io/s-dyrf/
♻ ☆ Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation
Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
comment: 14 pages, 16 figures
♻ ☆ Multi-conditioned Graph Diffusion for Neural Architecture Search
Neural architecture search automates the design of neural network architectures usually by exploring a large and thus complex architecture search space. To advance the architecture search, we present a graph diffusion-based NAS approach that uses discrete conditional graph diffusion processes to generate high-performing neural network architectures. We then propose a multi-conditioned classifier-free guidance approach applied to graph diffusion networks to jointly impose constraints such as high accuracy and low hardware latency. Unlike the related work, our method is completely differentiable and requires only a single model training. In our evaluations, we show promising results on six standard benchmarks, yielding novel and unique architectures at a fast speed, i.e. less than 0.2 seconds per architecture. Furthermore, we demonstrate the generalisability and efficiency of our method through experiments on ImageNet dataset.
comment: Accepted at Transactions on Machine Learning Research (TMLR)
♻ ☆ PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance.
comment: Project page: https://pi-animator.github.io/
♻ ☆ FunQA: Towards Surprising Video Comprehension
Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question-answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model's capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counter-intuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Moreover, we propose FunMentor, an agent designed for Vision-Language Models (VLMs) that uses multi-turn dialogues to enhance models' understanding of counter-intuitiveness. Extensive experiments with existing VLMs demonstrate the effectiveness of FunMentor and reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.
comment: Project Page: https://funqa-benchmark.github.io/ Codebase: https://github.com/Jingkang50/FunQA
♻ ☆ You Only Need Two Detectors to Achieve Multi-Modal 3D Multi-Object Tracking
In the classical tracking-by-detection (TBD) paradigm, detection and tracking are separately and sequentially conducted, and data association must be properly performed to achieve satisfactory tracking performance. In this paper, a new end-to-end multi-object tracking framework is proposed, which integrates object detection and multi-object tracking into a single model. The proposed tracking framework eliminates the complex data association process in the classical TBD paradigm, and requires no additional training. Secondly, the regression confidence of historical trajectories is investigated, and the possible states of a trajectory (weak object or strong object) in the current frame are predicted. Then, a confidence fusion module is designed to guide non-maximum suppression for trajectories and detections to achieve ordered and robust tracking. Thirdly, by integrating historical trajectory features, the regression performance of the detector is enhanced, which better reflects the occlusion and disappearance patterns of objects in real world. Lastly, extensive experiments are conducted on the commonly used KITTI and Waymo datasets. The results show that the proposed framework can achieve robust tracking by using only a 2D detector and a 3D detector, and it is proven more accurate than many of the state-of-the-art TBD-based multi-modal tracking methods. The source codes of the proposed method are available at https://github.com/wangxiyang2022/YONTD-MOT.
comment: 11 pages, 7 figures
♻ ☆ Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Sora is the first large-scale generalist video generation model that garnered significant attention across society. Since its launch by OpenAI in February 2024, no other video generation models have paralleled {Sora}'s performance or its capacity to support a broad spectrum of video generation tasks. Additionally, there are only a few fully published video generation models, with the majority being closed-source. To address this gap, this paper proposes a new multi-agent framework Mora, which incorporates several advanced visual AI agents to replicate generalist video generation demonstrated by Sora. In particular, Mora can utilize multiple visual agents and successfully mimic Sora's video generation capabilities in various tasks, such as (1) text-to-video generation, (2) text-conditional image-to-video generation, (3) extend generated videos, (4) video-to-video editing, (5) connect videos and (6) simulate digital worlds. Our extensive experimental results show that Mora achieves performance that is proximate to that of Sora in various tasks. However, there exists an obvious performance gap between our work and Sora when assessed holistically. In summary, we hope this project can guide the future trajectory of video generation through collaborative AI agents.
♻ ☆ MC-NeRF: Multi-Camera Neural Radiance Fields for Multi-Camera Image Acquisition Systems
Neural Radiance Fields (NeRF) use multi-view images for 3D scene representation, demonstrating remarkable performance. As one of the primary sources of multi-view images, multi-camera systems encounter challenges such as varying intrinsic parameters and frequent pose changes. Most previous NeRF-based methods assume a unique camera and rarely consider multi-camera scenarios. Besides, some NeRF methods that can optimize intrinsic and extrinsic parameters still remain susceptible to suboptimal solutions when these parameters are poor initialized. In this paper, we propose MC-NeRF, a method that enables joint optimization of both intrinsic and extrinsic parameters alongside NeRF. The method also supports each image corresponding to independent camera parameters. First, we tackle coupling issue and the degenerate case that arise from the joint optimization between intrinsic and extrinsic parameters. Second, based on the proposed solutions, we introduce an efficient calibration image acquisition scheme for multi-camera systems, including the design of calibration object. Finally, we present an end-to-end network with training sequence that enables the estimation of intrinsic and extrinsic parameters, along with the rendering network. Furthermore, recognizing that most existing datasets are designed for a unique camera, we construct a real multi-camera image acquisition system and create a corresponding new dataset, which includes both simulated data and real-world captured images. Experiments confirm the effectiveness of our method when each image corresponds to different camera parameters. Specifically, we use multi-cameras, each with different intrinsic and extrinsic parameters in real-world system, to achieve 3D scene representation without providing initial poses.
comment: This manuscript is currently under review
♻ ☆ ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting CVPR 2024
The long-tailed distribution problem in medical image analysis reflects a high prevalence of common conditions and a low prevalence of rare ones, which poses a significant challenge in developing a unified model capable of identifying rare or novel tumor categories not encountered during training. In this paper, we propose a new zero-shot pan-tumor segmentation framework (ZePT) based on query-disentangling and self-prompting to segment unseen tumor categories beyond the training set. ZePT disentangles the object queries into two subsets and trains them in two stages. Initially, it learns a set of fundamental queries for organ segmentation through an object-aware feature grouping strategy, which gathers organ-level visual features. Subsequently, it refines the other set of advanced queries that focus on the auto-generated visual prompts for unseen tumor segmentation. Moreover, we introduce query-knowledge alignment at the feature level to enhance each query's discriminative representation and generalizability. Extensive experiments on various tumor segmentation tasks demonstrate the performance superiority of ZePT, which surpasses the previous counterparts and evidence the promising ability for zero-shot tumor segmentation in real-world settings.
comment: This paper has been accepted by CVPR 2024
♻ ☆ FSC: Few-point Shape Completion CVPR 2024
While previous studies have demonstrated successful 3D object shape completion with a sufficient number of points, they often fail in scenarios when a few points, e.g. tens of points, are observed. Surprisingly, via entropy analysis, we find that even a few points, e.g. 64 points, could retain substantial information to help recover the 3D shape of the object. To address the challenge of shape completion with very sparse point clouds, we then propose Few-point Shape Completion (FSC) model, which contains a novel dual-branch feature extractor for handling extremely sparse inputs, coupled with an extensive branch for maximal point utilization with a saliency branch for dynamic importance assignment. This model is further bolstered by a two-stage revision network that refines both the extracted features and the decoder output, enhancing the detail and authenticity of the completed point cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from a few points. The proposed Few-point Shape Completion (FSC) model outperforms previous methods on both few-point inputs and many-point inputs, and shows good generalizability to different object categories.
comment: Accepted by CVPR 2024
♻ ☆ RGBD GS-ICP SLAM
Simultaneous Localization and Mapping (SLAM) with dense representation plays a key role in robotics, Virtual Reality (VR), and Augmented Reality (AR) applications. Recent advancements in dense representation SLAM have highlighted the potential of leveraging neural scene representation and 3D Gaussian representation for high-fidelity spatial representation. In this paper, we propose a novel dense representation SLAM approach with a fusion of Generalized Iterative Closest Point (G-ICP) and 3D Gaussian Splatting (3DGS). In contrast to existing methods, we utilize a single Gaussian map for both tracking and mapping, resulting in mutual benefits. Through the exchange of covariances between tracking and mapping processes with scale alignment techniques, we minimize redundant computations and achieve an efficient system. Additionally, we enhance tracking accuracy and mapping quality through our keyframe selection methods. Experimental results demonstrate the effectiveness of our approach, showing an incredibly fast speed up to 107 FPS (for the entire system) and superior quality of the reconstructed map.
♻ ☆ CPA-Enhancer: Chain-of-Thought Prompted Adaptive Enhancer for Object Detection under Unknown Degradations
Object detection methods under known single degradations have been extensively investigated. However, existing approaches require prior knowledge of the degradation type and train a separate model for each, limiting their practical applications in unpredictable environments. To address this challenge, we propose a chain-of-thought (CoT) prompted adaptive enhancer, CPA-Enhancer, for object detection under unknown degradations. Specifically, CPA-Enhancer progressively adapts its enhancement strategy under the step-by-step guidance of CoT prompts, that encode degradation-related information. To the best of our knowledge, it's the first work that exploits CoT prompting for object detection tasks. Overall, CPA-Enhancer is a plug-and-play enhancement model that can be integrated into any generic detectors to achieve substantial gains on degraded images, without knowing the degradation type priorly. Experimental results demonstrate that CPA-Enhancer not only sets the new state of the art for object detection but also boosts the performance of other downstream vision tasks under unknown degradations.
♻ ☆ S2DM: Sector-Shaped Diffusion Models for Video Generation
Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at https://s2dm.github.io/S2DM/.
comment: 17 pages, 6 figures
♻ ☆ AI-Dentify: Deep learning for proximal caries detection on bitewing x-ray -- HUNT4 Oral Health Study
Background: Dental caries diagnosis requires the manual inspection of diagnostic bitewing images of the patient, followed by a visual inspection and probing of the identified dental pieces with potential lesions. Yet the use of artificial intelligence, and in particular deep-learning, has the potential to aid in the diagnosis by providing a quick and informative analysis of the bitewing images. Methods: A dataset of 13,887 bitewings from the HUNT4 Oral Health Study were annotated individually by six different experts, and used to train three different object detection deep-learning architectures: RetinaNet (ResNet50), YOLOv5 (M size), and EfficientDet (D0 and D1 sizes). A consensus dataset of 197 images, annotated jointly by the same six dentist, was used for evaluation. A five-fold cross validation scheme was used to evaluate the performance of the AI models. Results: he trained models show an increase in average precision and F1-score, and decrease of false negative rate, with respect to the dental clinicians. When compared against the dental clinicians, the YOLOv5 model shows the largest improvement, reporting 0.647 mean average precision, 0.548 mean F1-score, and 0.149 mean false negative rate. Whereas the best annotators on each of these metrics reported 0.299, 0.495, and 0.164 respectively. Conclusion: Deep-learning models have shown the potential to assist dental professionals in the diagnosis of caries. Yet, the task remains challenging due to the artifacts natural to the bitewing images.
comment: 24 pages, 5 figure, 7 tables
♻ ☆ Event-based Simultaneous Localization and Mapping: A Comprehensive Survey
In recent decades, visual simultaneous localization and mapping (vSLAM) has gained significant interest in both academia and industry. It estimates camera motion and reconstructs the environment concurrently using visual sensors on a moving robot. However, conventional cameras are limited by hardware, including motion blur and low dynamic range, which can negatively impact performance in challenging scenarios like high-speed motion and high dynamic range illumination. Recent studies have demonstrated that event cameras, a new type of bio-inspired visual sensor, offer advantages such as high temporal resolution, dynamic range, low power consumption, and low latency. This paper presents a timely and comprehensive review of event-based vSLAM algorithms that exploit the benefits of asynchronous and irregular event streams for localization and mapping tasks. The review covers the working principle of event cameras and various event representations for preprocessing event data. It also categorizes event-based vSLAM methods into four main categories: feature-based, direct, motion-compensation, and deep learning methods, with detailed discussions and practical guidance for each approach. Furthermore, the paper evaluates the state-of-the-art methods on various benchmarks, highlighting current challenges and future opportunities in this emerging research area. A public repository will be maintained to keep track of the rapid developments in this field at {\url{https://github.com/kun150kun/ESLAM-survey}}.
♻ ☆ SyncTweedies: A General Generative Framework Based on Synchronized Diffusions
We introduce a general framework for generating diverse visual content, including ambiguous images, panorama images, mesh textures, and Gaussian splat textures, by synchronizing multiple diffusion processes. We present exhaustive investigation into all possible scenarios for synchronizing multiple diffusion processes through a canonical space and analyze their characteristics across applications. In doing so, we reveal a previously unexplored case: averaging the outputs of Tweedie's formula while conducting denoising in multiple instance spaces. This case also provides the best quality with the widest applicability to downstream tasks. We name this case SyncTweedies. In our experiments generating visual content aforementioned, we demonstrate the superior quality of generation by SyncTweedies compared to other synchronization methods, optimization-based and iterative-update-based methods.
comment: Project page: https://synctweedies.github.io/
♻ ☆ Detection Is Tracking: Point Cloud Multi-Sweep Deep Learning Models Revisited
Conventional tracking paradigm takes in instantaneous measurements such as range and bearing, and produces object tracks across time. In applications such as autonomous driving, lidar measurements in the form of point clouds are usually passed through a "virtual sensor" realized by a deep learning model, to produce "measurements" such as bounding boxes, which are in turn ingested by a tracking module to produce object tracks. Very often multiple lidar sweeps are accumulated in a buffer to merge and become the input to the virtual sensor. We argue in this paper that such an input already contains temporal information, and therefore the virtual sensor output should also contain temporal information, not just instantaneous values for the time corresponding to the end of the buffer. In particular, we present the deep learning model called MULti-Sweep PAired Detector (MULSPAD) that produces, for each detected object, a pair of bounding boxes at both the end time and the beginning time of the input buffer. This is achieved with fairly straightforward changes in commonly used lidar detection models, and with only marginal extra processing, but the resulting symmetry is satisfying. Such paired detections make it possible not only to construct rudimentary trackers fairly easily, but also to construct more sophisticated trackers that can exploit the extra information conveyed by the pair and be robust to choices of motion models and object birth/death models. We have conducted preliminary training and experimentation using Waymo Open Dataset, which shows the efficacy of our proposed method.
comment: My previous employer Motional is requiring a review and approval process before I can publish this paper
♻ ☆ Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning
Instruction tuning of Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, the diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflict for the same set of model parameters, resulting in sub-optimal instructionfollowing abilities. To address that, we propose the Mixture of Clusterconditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve generalization capabilities of MoCLE for novel instructions. Extensive experiments on 11 zero-shot tasks demonstrate the effectiveness of MoCLE.
comment: Project website: https://gyhdog99.github.io/projects/mocle/
♻ ☆ ToonAging: Face Re-Aging upon Artistic Portrait Style Transfer
Face re-aging is a prominent field in computer vision and graphics, with significant applications in photorealistic domains such as movies, advertising, and live streaming. Recently, the need to apply face re-aging to non-photorealistic images, like comics, illustrations, and animations, has emerged as an extension in various entertainment sectors. However, the lack of a network that can seamlessly edit the apparent age in NPR images has limited these tasks to a naive, sequential approach. This often results in unpleasant artifacts and a loss of facial attributes due to domain discrepancies. In this paper, we introduce a novel one-stage method for face re-aging combined with portrait style transfer, executed in a single generative step. We leverage existing face re-aging and style transfer networks, both trained within the same PR domain. Our method uniquely fuses distinct latent vectors, each responsible for managing aging-related attributes and NPR appearance. By adopting an exemplar-based approach, our method offers greater flexibility compared to domain-level fine-tuning approaches, which typically require separate training or fine-tuning for each domain. This effectively addresses the limitation of requiring paired datasets for re-aging and domain-level, data-driven approaches for stylization. Our experiments show that our model can effortlessly generate re-aged images while simultaneously transferring the style of examples, maintaining both natural appearance and controllability.
comment: 14 pages, 15 figures, 1 table
♻ ☆ Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
Multimodal large language models (MLLMs) have shown impressive reasoning abilities, which, however, are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct robust MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO enhances model safety significantly (e.g., a 37.6% improvement on the MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for the LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
comment: Project Page: https://gyhdog99.github.io/projects/ecso/
♻ ☆ MV-ROPE: Multi-view Constraints for Robust Category-level Object Pose and Size Estimation
Recently there has been a growing interest in category-level object pose and size estimation, and prevailing methods commonly rely on single view RGB-D images. However, one disadvantage of such methods is that they require accurate depth maps which cannot be produced by consumer-grade sensors. Furthermore, many practical real-world situations involve a moving camera that continuously observes its surroundings, and the temporal information of the input video streams is simply overlooked by single-view methods. We propose a novel solution that makes use of RGB video streams. Our framework consists of three modules: a scale-aware monocular dense SLAM solution, a lightweight object pose predictor, and an object-level pose graph optimizer. The SLAM module utilizes a video stream and additional scale-sensitive readings to estimate camera poses and metric depth. The object pose predictor then generates canonical object representations from RGB images. The object pose is estimated through geometric registration of these canonical object representations with estimated object depth points. All per-view estimates finally undergo optimization within a pose graph, culminating in the output of robust and accurate canonical object poses. Our experimental results demonstrate that when utilizing public dataset sequences with high-quality depth information, the proposed method exhibits comparable performance to state-of-the-art RGB-D methods. We also collect and evaluate on new datasets containing depth maps of varying quality to further quantitatively benchmark the proposed method alongside previous RGB-D based methods. We demonstrate a significant advantage in scenarios where depth input is absent or the quality of depth sensing is limited.
♻ ☆ D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction
Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm, enhancing the stability of diffusion and reverse processes and the precision of feature projection. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding, enhancing the reconstruction performance of the hand-occluded region of the object. Experiments on the synthetic ObMan dataset and three real-world datasets HO3D, MOW and DexYCB demonstrate that our approach can surpass all other state-of-the-art methods. Codes will be released.
♻ ☆ Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate the audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning. The code and data split are available on our website (https://bolinlai.github.io/CSTS-EgoGazeAnticipation/).
comment: 30 pages
♻ ☆ 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation CVPR 2024
Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.
comment: CVPR 2024 CAMERA-READY
♻ ☆ Promoting Segment Anything Model towards Highly Accurate Dichotomous Image Segmentation
The Segment Anything Model (SAM) represents a significant breakthrough into foundation models for computer vision, providing a large-scale image segmentation model. However, despite SAM's zero-shot performance, its segmentation masks lack fine-grained details, particularly in accurately delineating object boundaries. We have high expectations regarding whether SAM, as a foundation model, can be improved towards highly accurate object segmentation, which is known as dichotomous image segmentation (DIS). To address this issue, we propose DIS-SAM, which advances SAM towards DIS with extremely accurate details. DIS-SAM is a framework specifically tailored for highly accurate segmentation, maintaining SAM's promptable design. DIS-SAM employs a two-stage approach, integrating SAM with a modified IS-Net dedicated to DIS. Despite its simplicity, DIS-SAM demonstrates significantly enhanced segmentation accuracy compared to SAM and HQ-SAM.
♻ ☆ Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation
As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of ``getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success)
♻ ☆ mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
comment: Working in Process
♻ ☆ ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection
Out-of-distribution (OOD) detection methods often exploit auxiliary outliers to train model identifying OOD samples, especially discovering challenging outliers from auxiliary outliers dataset to improve OOD detection. However, they may still face limitations in effectively distinguishing between the most challenging OOD samples that are much like in-distribution (ID) data, i.e., \idlike samples. To this end, we propose a novel OOD detection framework that discovers \idlike outliers using CLIP \cite{DBLP:conf/icml/RadfordKHRGASAM21} from the vicinity space of the ID samples, thus helping to identify these most challenging OOD samples. Then a prompt learning framework is proposed that utilizes the identified \idlike outliers to further leverage the capabilities of CLIP for OOD detection. Benefiting from the powerful CLIP, we only need a small number of ID samples to learn the prompts of the model without exposing other auxiliary outlier datasets. By focusing on the most challenging \idlike OOD samples and elegantly exploiting the capabilities of CLIP, our method achieves superior few-shot learning performance on various real-world image datasets (e.g., in 4-shot OOD detection on the ImageNet-1k dataset, our method reduces the average FPR95 by 12.16\% and improves the average AUROC by 2.76\%, compared to state-of-the-art methods). Code is available at https://github.com/ycfate/ID-like.
♻ ☆ BigGait: Learning Gait Representation You Want by Large Vision Models
Gait recognition stands as one of the most pivotal remote identification technologies and progressively expands across research and industry communities. However, existing gait recognition methods heavily rely on task-specific upstream driven by supervised learning to provide explicit gait representations like silhouette sequences, which inevitably introduce expensive annotation costs and potential error accumulation. Escaping from this trend, this work explores effective gait representations based on the all-purpose knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a simple yet efficient gait framework, termed BigGait. Specifically, the Gait Representation Extractor (GRE) within BigGait draws upon design principles from established gait representations, effectively transforming all-purpose knowledge into implicit gait representations without requiring third-party supervision signals. Experiments on CCPG, CAISA-B* and SUSTech1K indicate that BigGait significantly outperforms the previous methods in both within-domain and cross-domain tasks in most cases, and provides a more practical paradigm for learning the next-generation gait representation. Finally, we delve into prospective challenges and promising directions in LVMs-based gait recognition, aiming to inspire future work in this emerging topic. The source code is available at https://github.com/ShiqiYu/OpenGait.
♻ ☆ Image super-resolution via dynamic network
Convolutional neural networks (CNNs) depend on deep network architectures to extract accurate information for image super-resolution. However, obtained information of these CNNs cannot completely express predicted high-quality images for complex scenes. In this paper, we present a dynamic network for image super-resolution (DSRNet), which contains a residual enhancement block, wide enhancement block, feature refinement block and construction block. The residual enhancement block is composed of a residual enhanced architecture to facilitate hierarchical features for image super-resolution. To enhance robustness of obtained super-resolution model for complex scenes, a wide enhancement block achieves a dynamic architecture to learn more robust information to enhance applicability of an obtained super-resolution model for varying scenes. To prevent interference of components in a wide enhancement block, a refinement block utilizes a stacked architecture to accurately learn obtained features. Also, a residual learning operation is embedded in the refinement block to prevent long-term dependency problem. Finally, a construction block is responsible for reconstructing high-quality images. Designed heterogeneous architecture can not only facilitate richer structural information, but also be lightweight, which is suitable for mobile digital devices. Experimental results shows that our method is more competitive in terms of performance and recovering time of image super-resolution and complexity. The code of DSRNet can be obtained at https://github.com/hellloxiaotian/DSRNet.
♻ ☆ Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future
With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to https://github.com/OpenDriveLab/DriveAGI.
comment: This article is a simplified English translation of corresponding Chinese article. Please refer to Chinese version for the complete content
♻ ☆ LSKNet: A Foundation Lightweight Backbone for Remote Sensing
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.
comment: arXiv admin note: substantial text overlap with arXiv:2303.09030
♻ ☆ Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization ICLR 2024
Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual input as a prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified form. Specifically, we introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. Coped with this tokenizer, the presented foundation model called LaVIT can handle both image and text indiscriminately under the same generative learning paradigm. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously. Extensive experiments further showcase that it outperforms the existing models by a large margin on massive vision-language tasks. Our code and models are available at https://github.com/jy0205/LaVIT.
comment: ICLR 2024
♻ ☆ LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
Generating instructional images of human daily actions from an egocentric viewpoint serves as a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image. Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap. To this end, we propose to Learn EGOcentric (LEGO) action frame generation via visual instruction tuning. First, we introduce a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning. Then we propose a novel method to leverage image and text embeddings from the VLLM as additional conditioning to improve the performance of a diffusion model. We validate our model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show substantial improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights in our method. More details of the dataset and code are available on the website (https://bolinlai.github.io/Lego_EgoActGen/).
comment: 34 pages
♻ ☆ Predicting Generalization of AI Colonoscopy Models to Unseen Data
$\textbf{Background}$: Generalizability of AI colonoscopy algorithms is important for wider adoption in clinical practice. However, current techniques for evaluating performance on unseen data require expensive and time-intensive labels. $\textbf{Methods}$: We use a "Masked Siamese Network" (MSN) to identify novel phenomena in unseen data and predict polyp detector performance. MSN is trained to predict masked out regions of polyp images, without any labels. We test MSN's ability to be trained on data only from Israel and detect unseen techniques, narrow-band imaging (NBI) and chromendoscoy (CE), on colonoscopes from Japan (354 videos, 128 hours). We also test MSN's ability to predict performance of Computer Aided Detection (CADe) of polyps on colonoscopies from both countries, even though MSN is not trained on data from Japan. $\textbf{Results}$: MSN correctly identifies NBI and CE as less similar to Israel whitelight than Japan whitelight (bootstrapped z-test, |z| > 496, p < 10^-8 for both) using the label-free Frechet distance. MSN detects NBI with 99% accuracy, predicts CE better than our heuristic (90% vs 79% accuracy) despite being trained only on whitelight, and is the only method that is robust to noisy labels. MSN predicts CADe polyp detector performance on in-domain Israel and out-of-domain Japan colonoscopies (r=0.79, 0.37 respectively). With few examples of Japan detector performance to train on, MSN prediction of Japan performance improves (r=0.56). $\textbf{Conclusion}$: Our technique can identify distribution shifts in clinical data and can predict CADe detector performance on unseen data, without labels. Our self-supervised approach can aid in detecting when data in practice is different from training, such as between hospitals or data has meaningfully shifted from training. MSN has potential for application to medical image domains beyond colonoscopy.
♻ ☆ Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation CVPR 2024
Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. In this paper, we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model, which is named Retrieval-Augmented Layout Transformer (RALF), retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines.
comment: Accepted to CVPR 2024, Project website: https://udonda.github.io/RALF/
♻ ☆ BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image CVPR 2024
Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes, some recent work has tackled the reconstruction of hand textures on top of shapes. However, these methods are often limited to capturing pixels on the visible side of a hand, requiring diverse views of the hand in a video or multiple images as input. In this paper, we propose a novel method, BiTT(Bi-directional Texture reconstruction of Two hands), which is the first end-to-end trainable method for relightable, pose-free texture reconstruction of two interacting hands taking only a single RGB image, by three novel components: 1) bi-directional (left $\leftrightarrow$ right) texture reconstruction using the texture symmetry of left / right hands, 2) utilizing a texture parametric model for hand texture recovery, and 3) the overall coarse-to-fine stage pipeline for reconstructing personalized texture of two interacting hands. BiTT first estimates the scene light condition and albedo image from an input image, then reconstructs the texture of both hands through the texture parametric model and bi-directional texture reconstructor. In experiments using InterHand2.6M and RGB2Hands datasets, our method significantly outperforms state-of-the-art hand texture reconstruction methods quantitatively and qualitatively. The code is available at https://github.com/yunminjin2/BiTT
comment: Accepted by CVPR 2024
♻ ☆ On Image Search in Histopathology
Pathology images of histopathology can be acquired from camera-mounted microscopes or whole slide scanners. Utilizing similarity calculations to match patients based on these images holds significant potential in research and clinical contexts. Recent advancements in search technologies allow for implicit quantification of tissue morphology across diverse primary sites, facilitating comparisons and enabling inferences about diagnosis, and potentially prognosis, and predictions for new patients when compared against a curated database of diagnosed and treated cases. In this paper, we comprehensively review the latest developments in image search technologies for histopathology, offering a concise overview tailored for computational pathology researchers seeking effective, fast and efficient image search methods in their work.
comment: A chapter in the Book "Artificial INtelligence in Digital Pathology" by Cohen and Chauhan, 2024
♻ ☆ Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention CVPR 2024
Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.
comment: Accepted by CVPR 2024
♻ ☆ SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation CVPR 2024
Despite their ability to generate high-resolution and diverse images from text prompts, text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However, previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training, either from real data or synthetically generated by the teacher model. In response to this limitation, we present a novel image-free distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from text-to-3D synthesis, in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth, our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity, our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques.
comment: Accepted to CVPR 2024; Project Page: https://thuanz123.github.io/swiftbrush/
♻ ☆ Physics-Enhanced Multi-fidelity Learning for Optical Surface Imprint
Human fingerprints serve as one unique and powerful characteristic for each person, from which policemen can recognize the identity. Similar to humans, many natural bodies and intrinsic mechanical qualities can also be uniquely identified from surface characteristics. To measure the elasto-plastic properties of one material, one formally sharp indenter is pushed into the measured body under constant force and retracted, leaving a unique residual imprint of the minute size from several micrometers to nanometers. However, one great challenge is how to map the optical image of this residual imprint into the real wanted mechanical properties, \ie, the tensile force curve. In this paper, we propose a novel method to use multi-fidelity neural networks (MFNN) to solve this inverse problem. We first build up the NN model via pure simulation data, and then bridge the sim-to-real gap via transfer learning. Considering the difficulty of collecting real experimental data, we use NN to dig out the unknown physics and also implant the known physics into the transfer learning framework, thus highly improving the model stability and decreasing the data requirement. The final constructed model only needs three-shot calibration of real materials. We tested the final model across 20 real materials and achieved satisfying accuracy. This work serves as one great example of applying machine learning into scientific research, especially under the constraints of data limitation and fidelity variance.
comment: 15 pages, 11 figure
♻ ☆ Rethinking Boundary Discontinuity Problem for Oriented Object Detection
Oriented object detection has been developed rapidly in the past few years, where rotation equivariance is crucial for detectors to predict rotated boxes. It is expected that the prediction can maintain the corresponding rotation when objects rotate, but severe mutation in angular prediction is sometimes observed when objects rotate near the boundary angle, which is well-known boundary discontinuity problem. The problem has been long believed to be caused by the sharp loss increase at the angular boundary, and widely used joint-optim IoU-like methods deal with this problem by loss-smoothing. However, we experimentally find that even state-of-the-art IoU-like methods actually fail to solve the problem. On further analysis, we find that the key to solution lies in encoding mode of the smoothing function rather than in joint or independent optimization. In existing IoU-like methods, the model essentially attempts to fit the angular relationship between box and object, where the break point at angular boundary makes the predictions highly unstable.To deal with this issue, we propose a dual-optimization paradigm for angles. We decouple reversibility and joint-optim from single smoothing function into two distinct entities, which for the first time achieves the objectives of both correcting angular boundary and blending angle with other parameters.Extensive experiments on multiple datasets show that boundary discontinuity problem is well-addressed. Moreover, typical IoU-like methods are improved to the same level without obvious performance gap. The code is available at https://github.com/hangxu-cv/cvpr24acm.
comment: cvpr 2024
♻ ☆ BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor, existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model, which makes them inapplicable to data-limited scenarios. In this work, motivated by the recent success of learnable prompts, we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP, i.e., influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator, such that the trigger can change text features via trigger-aware prompts, resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes, and shows a strong generalization capability under cross-dataset and cross-domain settings.
comment: 14 pages, 6 figures
♻ ☆ Hulk: A Universal Knowledge Translator for Human-Centric Tasks
Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk.
comment: 24 pages, 5 figures
♻ ☆ CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection
Recently, segmentation-based methods are quite popular in scene text detection, which mainly contain two steps: text kernel segmentation and expansion. However, the segmentation process only considers each pixel independently, and the expansion process is difficult to achieve a favorable accuracy-speed trade-off. In this paper, we propose a Context-aware and Boundary-guided Network (CBN) to tackle these problems. In CBN, a basic text detector is firstly used to predict initial segmentation results. Then, we propose a context-aware module to enhance text kernel feature representations, which considers both global and local contexts. Finally, we introduce a boundary-guided module to expand enhanced text kernels adaptively with only the pixels on the contours, which not only obtains accurate text boundaries but also keeps high speed, especially on high-resolution output maps. In particular, with a lightweight backbone, the basic detector equipped with our proposed CBN achieves state-of-the-art results on several popular benchmarks, and our proposed CBN can be plugged into several segmentation-based methods. Code is available at https://github.com/XiiZhao/cbn.pytorch.
comment: Accepted by IJCV 2024. Code is available at https://github.com/XiiZhao/cbn.pytorch
♻ ☆ Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in non-English languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a (quasi)-zero-shot manner, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.
comment: https://github.com/OpenBMB/VisCPM.git
♻ ☆ Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation
We introduce a method to generate temporally coherent human animation from a single image, a video, or a random noise. This problem has been formulated as modeling of an auto-regressive generation, i.e., to regress past frames to decode future frames. However, such unidirectional generation is highly prone to motion drifting over time, generating unrealistic human animation with significant artifacts such as appearance distortion. We claim that bidirectional temporal modeling enforces temporal coherence on a generative network by largely suppressing the motion ambiguity of human appearance. To prove our claim, we design a novel human animation framework using a denoising diffusion model: a neural network learns to generate the image of a person by denoising temporal Gaussian noises whose intermediate results are cross-conditioned bidirectionally between consecutive frames. In the experiments, our method demonstrates strong performance compared to existing unidirectional approaches with realistic temporal coherence.
comment: Project page: see https://typest.github.io/btdm
♻ ☆ AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks
Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.
comment: preprint
♻ ☆ iSLAM: Imperative SLAM
Simultaneous Localization and Mapping (SLAM) stands as one of the critical challenges in robot navigation. A SLAM system often consists of a front-end component for motion estimation and a back-end system for eliminating estimation drifts. Recent advancements suggest that data-driven methods are highly effective for front-end tasks, while geometry-based methods continue to be essential in the back-end processes. However, such a decoupled paradigm between the data-driven front-end and geometry-based back-end can lead to sub-optimal performance, consequently reducing the system's capabilities and generalization potential. To solve this problem, we proposed a novel self-supervised imperative learning framework, named imperative SLAM (iSLAM), which fosters reciprocal correction between the front-end and back-end, thus enhancing performance without necessitating any external supervision. Specifically, we formulate the SLAM problem as a bilevel optimization so that the front-end and back-end are bidirectionally connected. As a result, the front-end model can learn global geometric knowledge obtained through pose graph optimization by back-propagating the residuals from the back-end component. We showcase the effectiveness of this new framework through an application of stereo-inertial SLAM. The experiments show that the iSLAM training strategy achieves an accuracy improvement of 22% on average over a baseline model. To the best of our knowledge, iSLAM is the first SLAM system showing that the front-end and back-end components can mutually correct each other in a self-supervised manner.
comment: The paper has been accepted by IEEE Robotics and Automation Letters (RA-L)
♻ ☆ NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation CVPR 2024
Data-Free Knowledge Distillation (DFKD) has made significant recent strides by transferring knowledge from a teacher neural network to a student neural network without accessing the original data. Nonetheless, existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information. Consequently, these models struggle to effectively map this noise to the ground-truth sample distribution, resulting in prolonging training times and low-quality outputs. In this paper, we propose a novel Noisy Layer Generation method (NAYER) which relocates the random source from the input to a noisy layer and utilizes the meaningful constant label-text embedding (LTE) as the input. LTE is generated by using the language model once, and then it is stored in memory for all subsequent training processes. The significance of LTE lies in its ability to contain substantial meaningful inter-class information, enabling the generation of high-quality samples with only a few training steps. Simultaneously, the noisy layer plays a key role in addressing the issue of diversity in sample generation by preventing the model from overemphasizing the constrained label information. By reinitializing the noisy layer in each iteration, we aim to facilitate the generation of diverse samples while still retaining the method's efficiency, thanks to the ease of learning provided by LTE. Experiments carried out on multiple datasets demonstrate that our NAYER not only outperforms the state-of-the-art methods but also achieves speeds 5 to 15 times faster than previous approaches. The code is available at https://github.com/tmtuan1307/nayer.
comment: Accepted at CVPR 2024
♻ ☆ UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification
Vision-Language Pre-training (VLP) that utilizes the multi-modal information to promote the training efficiency and effectiveness, has achieved great success in vision recognition of natural domains and shown promise in medical imaging diagnosis for the Chest X-Rays (CXRs). However, current works mainly pay attention to the exploration on single dataset of CXRs, which locks the potential of this powerful paradigm on larger hybrid of multi-source CXRs datasets. We identify that although blending samples from the diverse sources offers the advantages to improve the model generalization, it is still challenging to maintain the consistent superiority for the task of each source due to the existing heterogeneity among sources. To handle this dilemma, we design a Conquer-and-Divide pre-training framework, termed as UniChest, aiming to make full use of the collaboration benefit of multiple sources of CXRs while reducing the negative influence of the source heterogeneity. Specially, the ``Conquer" stage in UniChest encourages the model to sufficiently capture multi-source common patterns, and the ``Divide" stage helps squeeze personalized patterns into different small experts (query networks). We conduct thorough experiments on many benchmarks, e.g., ChestX-ray14, CheXpert, Vindr-CXR, Shenzhen, Open-I and SIIM-ACR Pneumothorax, verifying the effectiveness of UniChest over a range of baselines, and release our codes and pre-training models at https://github.com/Elfenreigen/UniChest.
comment: Accepted at IEEE Transactions on Medical Imaging
♻ ☆ Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding
Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (e.g. visual and speech) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, but improvements can be obtained by incorporating information from longer-range temporal context across different modalities. Our experiments underscore the need to develop new approaches to these tasks. Data, model, and code will be released at https://brown-palm.github.io/Spacewalk-18/.
comment: Under submission. Code and models will be released at https://brown-palm.github.io/Spacewalk-18/
Information Retrieval
☆ Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions ACL 2024
This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work. The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.
comment: 10 pages, 4 figures, submitted to ACL 2024, for a screencast see https://www.youtube.com/watch?v=9GJExMelhdI
☆ FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
Modern Large Language Models (LLMs) are capable of following long and complex instructions that enable a diverse amount of user tasks. However, despite Information Retrieval (IR) models using LLMs as the backbone of their architectures, nearly all of them still only take queries as input, with no instructions. For the handful of recent models that do take instructions, it's unclear how they use them. We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR builds off the long history of the TREC conferences: as TREC provides human annotators with instructions (also known as narratives) to determine document relevance, so should IR models be able to understand and decide relevance based on these detailed instructions. Our evaluation benchmark starts with three deeply judged TREC collections and alters the annotator instructions, re-annotating relevant documents. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements (over 13%) after fine-tuning on our training set.
☆ Bilateral Unsymmetrical Graph Contrastive Learning for Recommendation
Recent methods utilize graph contrastive Learning within graph-structured user-item interaction data for collaborative filtering and have demonstrated their efficacy in recommendation tasks. However, they ignore that the difference relation density of nodes between the user- and item-side causes the adaptability of graphs on bilateral nodes to be different after multi-hop graph interaction calculation, which limits existing models to achieve ideal results. To solve this issue, we propose a novel framework for recommendation tasks called Bilateral Unsymmetrical Graph Contrastive Learning (BusGCL) that consider the bilateral unsymmetry on user-item node relation density for sliced user and item graph reasoning better with bilateral slicing contrastive training. Especially, taking into account the aggregation ability of hypergraph-based graph convolutional network (GCN) in digging implicit similarities is more suitable for user nodes, embeddings generated from three different modules: hypergraph-based GCN, GCN and perturbed GCN, are sliced into two subviews by the user- and item-side respectively, and selectively combined into subview pairs bilaterally based on the characteristics of inter-node relation structure. Furthermore, to align the distribution of user and item embeddings after aggregation, a dispersing loss is leveraged to adjust the mutual distance between all embeddings for maintaining learning ability. Comprehensive experiments on two public datasets have proved the superiority of BusGCL in comparison to various recommendation methods. Other models can simply utilize our bilateral slicing contrastive learning to enhance recommending performance without incurring extra expenses.
☆ Spectral Initialization for High-Dimensional Phase Retrieval with Biased Spatial Directions
We explore a spectral initialization method that plays a central role in contemporary research on signal estimation in nonconvex scenarios. In a noiseless phase retrieval framework, we precisely analyze the method's performance in the high-dimensional limit when sensing vectors follow a multivariate Gaussian distribution for two rotationally invariant models of the covariance matrix C. In the first model C is a projector on a lower dimensional space while in the second it is a Wishart matrix. Our analytical results extend the well-established case when C is the identity matrix. Our examination shows that the introduction of biased spatial directions leads to a substantial improvement in the spectral method's effectiveness, particularly when the number of measurements is less than the signal's dimension. This extension also consistently reveals a phase transition phenomenon dependent on the ratio between sample size and signal dimension. Surprisingly, both of these models share the same threshold value.
comment: 13 pages, 4 figures
☆ GTC: GNN-Transformer Co-contrastive Learning for Self-supervised Heterogeneous Graph Representation
Graph Neural Networks (GNNs) have emerged as the most powerful weapon for various graph tasks due to the message-passing mechanism's great local information aggregation ability. However, over-smoothing has always hindered GNNs from going deeper and capturing multi-hop neighbors. Unlike GNNs, Transformers can model global information and multi-hop interactions via multi-head self-attention and a proper Transformer structure can show more immunity to the over-smoothing problem. So, can we propose a novel framework to combine GNN and Transformer, integrating both GNN's local information aggregation and Transformer's global information modeling ability to eliminate the over-smoothing problem? To realize this, this paper proposes a collaborative learning scheme for GNN-Transformer and constructs GTC architecture. GTC leverages the GNN and Transformer branch to encode node information from different views respectively, and establishes contrastive learning tasks based on the encoded cross-view information to realize self-supervised heterogeneous graph representation. For the Transformer branch, we propose Metapath-aware Hop2Token and CG-Hetphormer, which can cooperate with GNN to attentively encode neighborhood information from different levels. As far as we know, this is the first attempt in the field of graph representation learning to utilize both GNN and Transformer to collaboratively capture different view information and conduct cross-view contrastive learning. The experiments on real datasets show that GTC exhibits superior performance compared with state-of-the-art methods. Codes can be available at https://github.com/PHD-lanyu/GTC.
☆ LimGen: Probing the LLMs for Generating Suggestive Limitations of Research Papers
Examining limitations is a crucial step in the scholarly research reviewing process, revealing aspects where a study might lack decisiveness or require enhancement. This aids readers in considering broader implications for further research. In this article, we present a novel and challenging task of Suggestive Limitation Generation (SLG) for research papers. We compile a dataset called LimGen, encompassing 4068 research papers and their associated limitations from the ACL anthology. We investigate several approaches to harness large language models (LLMs) for producing suggestive limitations, by thoroughly examining the related challenges, practical insights, and potential opportunities. Our LimGen dataset and code can be accessed at https://github.com/armbf/LimGen.
comment: 16 pages, 3 figures
♻ ☆ Language Modeling for Content-enriched Recommendation
Recommender systems are indispensable in the realm of online applications, and sequential recommendation has enjoyed considerable prevalence due to its capacity to encapsulate the dynamic shifts in user interests. However, previous sequential modeling methods still have limitations in capturing contextual information. The primary reason is the lack of understanding of domain-specific knowledge and item-related textual content by language models. Fortunately, the emergence of powerful language models has unlocked the potential to incorporate extensive world knowledge into recommendation algorithms, enabling them to go beyond simple item attributes and truly understand the world surrounding user preferences. To achieve this, we propose LANCER, which leverages the semantic understanding capabilities of pre-trained language models to generate personalized recommendations. Our approach bridges the gap between language models and recommender systems, resulting in more human-like recommendations. We demonstrate the effectiveness of our approach through a series of experiments conducted on multiple benchmark datasets, showing promising results and providing valuable insights into the influence of our model on sequential recommendation tasks. Furthermore, our experimental codes are publicly available.
♻ ☆ On Image Search in Histopathology
Pathology images of histopathology can be acquired from camera-mounted microscopes or whole slide scanners. Utilizing similarity calculations to match patients based on these images holds significant potential in research and clinical contexts. Recent advancements in search technologies allow for implicit quantification of tissue morphology across diverse primary sites, facilitating comparisons and enabling inferences about diagnosis, and potentially prognosis, and predictions for new patients when compared against a curated database of diagnosed and treated cases. In this paper, we comprehensively review the latest developments in image search technologies for histopathology, offering a concise overview tailored for computational pathology researchers seeking effective, fast and efficient image search methods in their work.
comment: A chapter in the Book "Artificial INtelligence in Digital Pathology" by Cohen and Chauhan, 2024
♻ ☆ Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation NAACL 2024
Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.
comment: NAACL 2024
Data Structures and Algorithms
☆ Fourier Transform-based Estimators for Data Sketches
In this paper we consider the problem of estimating the $f$-moment ($\sum_{v\in [n]} (f(\mathbf{x}(v))-f(0))$) of a dynamic vector $\mathbf{x}\in \mathbb{G}^n$ over some abelian group $(\mathbb{G},+)$, where the $\|f\|_\infty$ norm is bounded. We propose a simple sketch and new estimation framework based on the \emph{Fourier transform} of $f$. I.e., we decompose $f$ into a linear combination of homomorphisms $f_1,f_2,\ldots$ from $(\mathbb{G},+)$ to $(\mathbb{C},\times)$, estimate the $f_k$-moment for each $f_k$, and synthesize them to obtain an estimate of the $f$-moment. Our estimators are asymptotically unbiased and have variance asymptotic to $\|\mathbf{x}\|_0^2 (\|f\|_\infty^2 m^{-1} + \|\hat{f}\|_1^2 m^{-2})$, where the size of the sketch is $O(m\log n\log|\mathbb{G}|)$ bits. When $\mathbb{G}=\mathbb{Z}$ this problem can also be solved using off-the-shelf $\ell_0$-samplers with space $O(m\log^2 n)$ bits, which does not obviously generalize to finite groups. As a concrete benchmark, we extend Ganguly, Garofalakis, and Rastogi's singleton-detector-based sampler to work over $\mathbb{G}$ using $O(m\log n\log|\mathbb{G}|\log(m\log n))$ bits. We give some experimental evidence that the Fourier-based estimation framework is significantly more accurate than sampling-based approaches at the same memory footprint.
☆ Differentially Private Ad Conversion Measurement
In this work, we study ad conversion measurement, a central functionality in digital advertising, where an advertiser seeks to estimate advertiser website (or mobile app) conversions attributed to ad impressions that users have interacted with on various publisher websites (or mobile apps). Using differential privacy (DP), a notion that has gained in popularity due to its strong mathematical guarantees, we develop a formal framework for private ad conversion measurement. In particular, we define the notion of an operationally valid configuration of the attribution rule, DP adjacency relation, contribution bounding scope and enforcement point. We then provide, for the set of configurations that most commonly arises in practice, a complete characterization, which uncovers a delicate interplay between attribution and privacy.
comment: To appear in PoPETS 2024
☆ Approximation Algorithms for School Assignment: Group Fairness and Multi-criteria Optimization
We consider the problem of assigning students to schools, when students have different utilities for schools and schools have capacity. There are additional group fairness considerations over students that can be captured either by concave objectives, or additional constraints on the groups. We present approximation algorithms for this problem via convex program rounding that achieve various trade-offs between utility violation, capacity violation, and running time. We also show that our techniques easily extend to the setting where there are arbitrary covering constraints on the feasible assignment, capturing multi-criteria and ranking optimization.
☆ Approximation Algorithms for Network Design in Non-Uniform Fault Models
The Survivable Network Design problem (SNDP) is a well-studied problem, motivated by the design of networks that are robust to faults under the assumption that any subset of edges up to a specific number can fail. We consider non-uniform fault models where the subset of edges that fail can be specified in different ways. Our primary interest is in the flexible graph connectivity model, in which the edge set is partitioned into safe and unsafe edges. The goal is to design a network that has desired connectivity properties under the assumption that only unsafe edges up to a specific number can fail. We also discuss the bulk-robust model and the relative survivable network design model. While SNDP admits a 2-approximation, the approximability of problems in these more complex models is much less understood even in special cases. We make two contributions. Our first set of results are in the flexible graph connectivity model. Motivated by a conjecture that a constant factor approximation is feasible when the robustness parameters are fixed constants, we consider two important special cases, namely the single pair case, and the global connectivity case. For both these, we obtain constant factor approximations in several parameter ranges of interest. These are based on an augmentation framework and via decomposing the families of cuts that need to be covered into a small number of uncrossable families. Our second set of results are poly-logarithmic approximations for the bulk-robust model when the "width" of the given instance (the maximum number of edges that can fail in any particular scenario) is fixed. Via this, we derive corresponding approximations for the flexible graph connectivity model and the relative survivable network design model. The results are obtained via two algorithmic approaches and they have different tradeoffs in terms of the approximation ratio and generality.
comment: A preliminary version of this paper appeared in the Proc. of ICALP 2023 (10.4230/LIPIcs.ICALP.2023.36), which combines and extends results from two earlier versions: arXiv:2209.12273 (for the first set of results) and arXiv:2211.08324 (for the second set of results)
♻ ☆ Sampling Proper Colorings on Line Graphs Using $(1+o(1))Δ$ Colors
We prove that the single-site Glauber dynamics for sampling proper $q$-colorings mixes in $O_\Delta(n\log n)$ time on line graphs with $n$ vertices and maximum degree $\Delta$ when $q>(1+o(1))\Delta$. The main tool in our proof is the matrix trickle-down theorem developed by Abdolazimi, Liu and Oveis Gharan (FOCS, 2021).
comment: 28 pages
♻ ☆ Fully-Dynamic All-Pairs Shortest Paths: Likely Optimal Worst-Case Update Time
The All-Pairs Shortest Paths (APSP) problem is one of the fundamental problems in theoretical computer science. It asks to compute the distance matrix of a given $n$-vertex graph. We revisit the classical problem of maintaining the distance matrix under a fully dynamic setting undergoing vertex insertions and deletions with a fast worst-case running time and efficient space usage. Although an algorithm with amortized update-time $\tilde O(n ^ 2)$ has been known for nearly two decades [Demetrescu and Italiano, STOC 2003], the current best algorithm for worst-case running time with efficient space usage runs is due to [Gutenberg and Wulff-Nilsen, SODA 2020], which improves the space usage of the previous algorithm due to [Abraham, Chechik, and Krinninger, SODA 2017] to $\tilde O(n ^ 2)$ but fails to improve their running time of $\tilde O(n ^ {2 + 2 / 3})$. It has been conjectured that no algorithm in $O(n ^ {2.5 - \epsilon})$ worst-case update time exists. For graphs without negative cycles, we meet this conjectured lower bound by introducing a Monte Carlo algorithm running in randomized $\tilde O(n ^ {2.5})$ time while keeping the $\tilde O(n ^ 2)$ space bound from the previous algorithm. Our breakthrough is made possible by the idea of ``hop-dominant shortest paths,'' which are shortest paths with a constraint on hops (number of vertices) that remain shortest after we relax the constraint by a constant factor.
comment: Preprint
♻ ☆ Fully Dynamic Exact Edge Connectivity in Sublinear Time
Given a simple $n$-vertex, $m$-edge graph $G$ undergoing edge insertions and deletions, we give two new fully dynamic algorithms for exactly maintaining the edge connectivity of $G$ in $\tilde{O}(n)$ worst-case update time and $\tilde{O}(m^{1-1/31})$ amortized update time, respectively. Prior to our work, all dynamic edge connectivity algorithms either assumed bounded edge connectivity, guaranteed approximate solutions, or were restricted to edge insertions only. Our results provide an affirmative answer to an open question posed by Thorup [Combinatorica'07].
comment: corrected the runtime of the algorithm based on expander decompositions
♻ ☆ Quantum Langevin Dynamics for Optimization
We initiate the study of utilizing Quantum Langevin Dynamics (QLD) to solve optimization problems, particularly those non-convex objective functions that present substantial obstacles for traditional gradient descent algorithms. Specifically, we examine the dynamics of a system coupled with an infinite heat bath. This interaction induces both random quantum noise and a deterministic damping effect to the system, which nudge the system towards a steady state that hovers near the global minimum of objective functions. We theoretically prove the convergence of QLD in convex landscapes, demonstrating that the average energy of the system can approach zero in the low temperature limit with an exponential decay rate correlated with the evolution time. Numerically, we first show the energy dissipation capability of QLD by retracing its origins to spontaneous emission. Furthermore, we conduct detailed discussion of the impact of each parameter. Finally, based on the observations when comparing QLD with classical Fokker-Plank-Smoluchowski equation, we propose a time-dependent QLD by making temperature and $\hbar$ time-dependent parameters, which can be theoretically proven to converge better than the time-independent case and also outperforms a series of state-of-the-art quantum and classical optimization algorithms in many non-convex landscapes.
comment: 52 pages, 1 table, 25 figures
♻ ☆ Dually conformal hypergraphs
Given a hypergraph $\mathcal{H}$, the dual hypergraph of $\mathcal{H}$ is the hypergraph of all minimal transversals of $\mathcal{H}$. The dual hypergraph is always Sperner, that is, no hyperedge contains another. A special case of Sperner hypergraphs are the conformal Sperner hypergraphs, which correspond to the families of maximal cliques of graphs. All these notions play an important role in many fields of mathematics and computer science, including combinatorics, algebra, database theory, etc. In this paper we study conformality of dual hypergraphs and prove several results related to the problem of recognizing this property. In particular, we show that the problem is in co-NP and can be solved in polynomial time for hypergraphs of bounded dimension. In the special case of dimension $3$, we reduce the problem to $2$-Satisfiability. Our approach has an implication in algorithmic graph theory: we obtain a polynomial-time algorithm for recognizing graphs in which all minimal transversals of maximal cliques have size at most $k$, for any fixed $k$.
♻ ☆ Flip-width: Cops and Robber on dense graphs
We define new graph parameters, called flip-width, that generalize treewidth, degeneracy, and generalized coloring numbers for sparse graphs, and clique-width and twin-width for dense graphs. The flip-width parameters are defined using variants of the Cops and Robber game, in which the robber has speed bounded by a fixed constant $r\in\mathbb N\cup\{\infty\}$, and the cops perform flips (or perturbations) of the considered graph. We then propose a new notion of tameness of a graph class, called bounded flip-width, which is a dense counterpart of classes of bounded expansion of Ne\v{s}etril and Ossona de Mendez, and includes classes of bounded twin-width of Bonnet, Kim, Thomass{\'e}, and Watrigant. This unifies Sparsity Theory and Twin-width Theory, providing a common language for studying the central notions of the two theories, such as weak coloring numbers and twin-width -- corresponding to winning strategies of one player -- or dense shallow minors, rich divisions, or well-linked sets, corresponding to winning strategies of the other player. We prove that boundedness of flip-width is preserved by first-order interpretations, or transductions, generalizing previous results concerning classes of bounded expansion and bounded twin-width. We provide an algorithm approximating the flip-width of a given graph, which runs in slicewise polynomial time (XP) in the size of the graph. Finally, we propose a more general notion of tameness, called almost bounded flip-width, which is a dense counterpart of nowhere dense classes. We conjecture, and provide evidence, that classes with almost bounded flip-width coincide with monadically dependent (or monadically NIP) classes, introduced by Shelah in model theory. We also provide evidence that classes of almost bounded flip-width characterise the hereditary graph classes for which the model-checking problem is fixed-parameter tractable.
comment: 80 pages
Signal Processing
☆ Information Rates of Successive Interference Cancellation for Optical Fiber
Successive interference cancellation (SIC) is used to approach the achievable information rates (AIRs) of joint detection and decoding for long-haul optical fiber links. The AIRs of memoryless ring constellations are compared to those of circularly symmetric complex Gaussian modulation for surrogate channel models with correlated phase noise. Simulations are performed for 1000 km of standard single-mode fiber with ideal Raman amplification. In this setup, 32 rings and 16 SIC-stages with Gaussian message-passing receivers achieve the AIR peaks of previous work. The computational complexity scales in proportion to the number of SIC-stages, where one stage has the complexity of separate detection and decoding.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Exploring the Task-agnostic Trait of Self-supervised Learning in the Context of Detecting Mental Disorders
Self-supervised learning (SSL) has been investigated to generate task-agnostic representations across various domains. However, such investigation has not been conducted for detecting multiple mental disorders. The rationale behind the existence of a task-agnostic representation lies in the overlapping symptoms among multiple mental disorders. Consequently, the behavioural data collected for mental health assessment may carry a mixed bag of attributes related to multiple disorders. Motivated by that, in this study, we explore a task-agnostic representation derived through SSL in the context of detecting major depressive disorder (MDD) and post-traumatic stress disorder (PTSD) using audio and video data collected during interactive sessions. This study employs SSL models trained by predicting multiple fixed targets or masked frames. We propose a list of fixed targets to make the generated representation more efficient for detecting MDD and PTSD. Furthermore, we modify the hyper-parameters of the SSL encoder predicting fixed targets to generate global representations that capture varying temporal contexts. Both these innovations are noted to yield improved detection performances for considered mental disorders and exhibit task-agnostic traits. In the context of the SSL model predicting masked frames, the generated global representations are also noted to exhibit task-agnostic traits.
☆ Channel Orthogonalization with Reconfigurable Surfaces: General Models, Theoretical Limits, and Effective Configuration
We envision a future in which multi-antenna technology effectively exploits the spatial domain as a set of non-interfering orthogonal resources, allowing for flexible resource allocation and efficient modulation/demodulation. Reconfigurable intelligent surface (RIS) has emerged as a promising technology which allows shaping the propagation environment for improved performance. This paper studies the ability of three extended types of reconfigurable surface (RS), including the recently proposed beyond diagonal RIS (BD-RIS), to achieve perfectly orthogonal channels in a general multi-user multiple-input multiple-output (MU-MIMO) scenario. We propose practical implementations for the three types of RS consisting of passive components, and obtain the corresponding restrictions on their reconfigurability. We then use these restrictions to derive closed-form conditions for achieving arbitrary (orthogonal) channels. We also study the problem of optimal orthogonal channel selection for achieving high channel gain without active amplification at the RS, and we propose some methods with satisfying performance. Finally, we provide efficient channel estimation and RS configuration techniques such that all the computation, including the channel selection, may be performed at the base station (BS). The numerical results showcase the potential and practicality of RS channel orthogonalization, thus taking a step towards orthogonal spatial domain multiplexing (OSDM).
comment: 13 pages, 4 figures. This work has been submitted to the IEEE for possible publication, copyright information may be affected upon publication
☆ Robust Resource Allocation for STAR-RIS Assisted SWIPT Systems
A simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted simultaneous wireless information and power transfer (SWIPT) system is proposed. More particularly, an STAR-RIS is deployed to assist in the information/power transfer from a multi-antenna access point (AP) to multiple single-antenna information users (IUs) and energy users (EUs), where two practical STAR-RIS operating protocols, namely energy splitting (ES) and time switching (TS), are employed. Under the imperfect channel state information (CSI) condition, a multi-objective optimization problem (MOOP) framework, that simultaneously maximizes the minimum data rate and minimum harvested power, is employed to investigate the fundamental rate-energy trade-off between IUs and EUs. To obtain the optimal robust resource allocation strategy, the MOOP is first transformed into a single-objective optimization problem (SOOP) via the {\epsilon}-constraint method, which is then reformulated by approximating semi-infinite inequality constraints with the S-procedure. For ES, an alternating optimization (AO)-based algorithm is proposed to jointly design AP active beamforming and STAR-RIS passive beamforming, where a penalty method is leveraged in STAR-RIS beamforming design. Furthermore, the developed algorithm is extended to optimize the time allocation policy and beamforming vectors in a two-layer iterative manner for TS. Numerical results reveal that: 1) deploying STAR-RISs achieves a significant performance gain over conventional RISs, especially in terms of harvested power for EUs; 2) the ES protocol obtains a better user fairness performance when focusing only on IUs or EUs, while the TS protocol yields a better balance between IUs and EUs; 3) the imperfect CSI affects IUs more significantly than EUs, whereas TS can confer a more robust design to attenuate these effects.
☆ Uplink soft handover for LEO constellations: how strong the inter-satellite link should be
We consider a constellation of low-earth-orbit (LEO) satellites connected to a handheld device on the ground. Due to the very large orbital speed, an effective handover strategy becomes of paramount importance. In particular, we study the benefits of soft handover in the uplink from the physical-layer point of view. We give a realistic model for both the ground-to-satellite and the inter-satellite links, following the 3GPP channel model for the former. We suppose that, during handover from a serving satellite to a target satellite, one of the two satellites forwards the received signal from the ground user to the other, thus acting as a relay. We quantify through simulations the loss of hard handover, compared to soft handover. For the latter, we test both amplify-and-forward (AF) and decode-and-forward (DF) relaying techniques and verify that, at least in the simulated conditions, DF does not repay, in terms of block error rate (BLER), the increase of complexity with respect to AF. Also, we study the effect of the LEO constellation size on the network BLER. Finally, we show that, with soft handover, the impact of misalignment on the inter-satellite link is severe, especially at optical frequencies.
☆ Coexisting Passive RIS and Active Relay Assisted NOMA Systems
A novel coexisting passive reconfigurable intelligent surface (RIS) and active decode-and-forward (DF) relay assisted non-orthogonal multiple access (NOMA) transmission framework is proposed. In particular, two communication protocols are conceived, namely Hybrid NOMA (H-NOMA) and Full NOMA (F-NOMA). Based on the proposed two protocols, both the sum rate maximization and max-min rate fairness problems are formulated for jointly optimizing the power allocation at the access point and relay as well as the passive beamforming design at the RIS. To tackle the non-convex problems, an alternating optimization (AO) based algorithm is first developed, where the transmit power and the RIS phase-shift are alternatingly optimized by leveraging the two-dimensional search and rank-relaxed difference-of-convex (DC) programming, respectively. Then, a two-layer penalty based joint optimization (JO) algorithm is developed to jointly optimize the resource allocation coefficients within each iteration. Finally, numerical results demonstrate that: i) the proposed coexisting RIS and relay assisted transmission framework is capable of achieving a significant user performance improvement than conventional schemes without RIS or relay; ii) compared with the AO algorithm, the JO algorithm requires less execution time at the cost of a slight performance loss; and iii) the H-NOMA and F-NOMA protocols are generally preferable for ensuring user rate fairness and enhancing user sum rate, respectively.
☆ STAR-RIS Assisted Downlink Active and Uplink Backscatter Communications with NOMA
A simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted downlink (DL) active and uplink (UL) backscatter communication (BackCom) framework is proposed. More particularly, a full-duplex (FD) base station (BS) communicates with the DL users via the STAR-RIS's transmission link, while exciting and receiving the information from the UL BackCom devices with the aid of the STAR-RIS's reflection link. Non-orthogonal multiple access (NOMA) is exploited in both DL and UL communications for improving the spectrum efficiency. The system weighted sum rate maximization problem is formulated for jointly optimizing the FD BS active receive and transmit beamforming, the STAR- RIS passive beamforming, and the DL NOMA decoding orders, subject to the DL user's individual rate constraint. To tackle this challenging non-convex problem, we propose an alternating optimization (AO) based algorithm for the joint active and passive beamforming design with a given DL NOMA decoding order. To address the potential high computational complexity required for exhaustive searching all the NOMA decoding orders, an efficient NOMA user ordering scheme is further developed. Finally, numerical results demonstrate that: i) compared with the baseline schemes employing conventional RISs or space division multiple access, the proposed scheme achieves higher performance gains; and ii) higher UL rate gain is obtained at a cost of DL performance degradation, as a remedy, a more flexible performance tradeoff can be achieved by introducing the STAR-RIS.
☆ Range-Angle Estimation for FDA-MIMO System With Frequency Offset
Frequency diverse array multiple-input multiple-output (FDA-MIMO) radar differs from the traditional phased array (PA) radar, and can form range-angle-dependent beampattern and differentiate between closely spaced targets sharing the same angle but occupying distinct range cells. In the FDA-MIMO radar, target range estimation is achieved by employing a subtle frequency variation between adjacent array antennas, so the estimation performance is degraded severely in a practical scenario with frequency offset. In this paper, the range-angle estimation problem for FDA-MIMO radar is considered with frequency offsets in both transmitting and receiving arrays. First, we build a system model for the FDA-MIMO radar with transmitting and receiving frequency offsets. Then, the frequency offset is transferred into an equalized additional noise. The noise characteristics are analyzed in detail theoretically, together with the influence on the range-angle estimation. Moreover, since the effect of the transmitting frequency offset is similar to additional colored noise, denoising algorithms are introduced to mitigate the performance deterioration caused by the frequency offset. Finally, Cram\'{e}r-Rao lower bounds (CRLB) for the range-angle estimation are derived in the scenario with the frequency offsets. Simulation results show the analysis of frequency offset and the corresponding estimation performance using different algorithms.
☆ Primary Rate Maximization in Movable Antennas Empowered Symbiotic Radio Communications
In this paper, we propose a movable antenna (MA) empowered scheme for symbiotic radio (SR) communication systems. Specifically, multiple antennas at the primary transmitter (PT) can be flexibly moved to favorable locations to boost the channel conditions of the primary and secondary transmissions. The primary transmission is achieved by the active transmission from the PT to the primary user (PU), while the backscatter device (BD) takes a ride over the incident signal from the PT to passively send the secondary signal to the PU. Under this setup, we consider a primary rate maximization problem by jointly optimizing the transmit beamforming and the positions of MAs at the PT under a practical bit error rate constraint on the secondary transmission. Then, an alternating optimization framework with the utilization of the successive convex approximation, semi-definite processing and simulated annealing (SA) modified particle swarm optimization (SA-PSO) methods is proposed to find the solution of the transmit beamforming and MAs' positions. Finally, numerical results are provided to demonstrate the performance improvement provided by the proposed MA empowered scheme and the proposed algorithm.
comment: To appear in IEEE VTC-Spring 2024. 6 Pages,5 figures
☆ Secure Outage Analysis for RIS-Aided MISO Systems with Randomly Located Eavesdroppers
In this paper, we consider the physical layer security of an RIS-assisted multiple-antenna communication system with randomly located eavesdroppers. The exact distributions of the received signal-to-noise-ratios (SNRs) at the legitimate user and the eavesdroppers located according to a Poisson point process (PPP) are derived, and a closed-form expression for the secrecy outage probability (SOP) is obtained. It is revealed that the secrecy performance is mainly affected by the number of RIS reflecting elements, and the impact of the transmit antennas and transmit power at the base station is marginal. In addition, when the locations of the randomly located eavesdroppers are unknown, deploying the RIS closer to the legitimate user rather than to the base station is shown to be more efficient. We also perform an analytical study demonstrating that the secrecy diversity order depends on the path loss exponent of the RIS-to-ground links. Finally, numerical simulations are conducted to verify the accuracy of these theoretical observations.
comment: Accepted by 2023 IEEE Globecom Workshops (GC Wkshps). arXiv admin note: substantial text overlap with arXiv:2312.16814
☆ Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation
In this article, we address the problem of federated learning in the presence of stragglers. For this problem, a coded federated learning framework has been proposed, where the central server aggregates gradients received from the non-stragglers and gradient computed from a privacy-preservation global coded dataset to mitigate the negative impact of the stragglers. However, when aggregating these gradients, fixed weights are consistently applied across iterations, neglecting the generation process of the global coded dataset and the dynamic nature of the trained model over iterations. This oversight may result in diminished learning performance. To overcome this drawback, we propose a new method named adaptive coded federated learning (ACFL). In ACFL, before the training, each device uploads a coded local dataset with additive noise to the central server to generate a global coded dataset under privacy preservation requirements. During each iteration of the training, the central server aggregates the gradients received from the non-stragglers and the gradient computed from the global coded dataset, where an adaptive policy for varying the aggregation weights is designed. Under this policy, we optimize the performance in terms of privacy and learning, where the learning performance is analyzed through convergence analysis and the privacy performance is characterized via mutual information differential privacy. Finally, we perform simulations to demonstrate the superiority of ACFL compared with the non-adaptive methods.
☆ Mitigating Subpopulation Bias for Fair Network Topology Inference
We consider fair network topology inference from nodal observations. Real-world networks often exhibit biased connections based on sensitive nodal attributes. Hence, different subpopulations of nodes may not share or receive information equitably. We thus propose an optimization-based approach to accurately infer networks while discouraging biased edges. To this end, we present bias metrics that measure topological demographic parity to be applied as convex penalties, suitable for most optimization-based graph learning methods. Moreover, we encourage equitable treatment for any number of subpopulations of differing sizes. We validate our method on synthetic and real-world simulations using networks with both biased and unbiased connections.
☆ RIS-assisted Cell-Free Massive MIMO Systems With Two-Timescale Design and Hardware Impairments
Integrating the reconfigurable intelligent surface (RIS) into a cell-free massive multiple-input multiple-output (CF-mMIMO) system is an effective solution to achieve high system capacity with low cost and power consumption. However, existing works of RIS-assisted systems mostly assumed perfect hardware, while the impact of hardware impairments (HWIs) is generally ignored. In this paper, we consider the general Rician fading channel and uplink transmission of the RIS-assisted CF-mMIMO system under transceiver impairments and RIS phase noise. To reduce the feedback overhead and power consumption, we propose a two-timescale transmission scheme to optimize the passive beamformers at RISs with statistical channel state information (CSI), while transmit beamformers at access points (APs) are designed based on instantaneous CSI. Also, the maximum ratio combining (MRC) detection is applied to the central processing unit (CPU). On this basis, we derive the closed-form approximate expression of the achievable rate, based on which the impact of HWIs and the power scaling laws are analyzed to draw useful theoretical insights. To maximize the users' sum rate or minimum rate, we first transform our rate expression into a tractable form, and then optimize the phase shifts of RISs based on an accelerated gradient ascent method. Finally, numerical results are presented to demonstrate the correctness of our derived expressions and validate the previous analysis, which provide some guidelines for the practical application of the imperfect RISs in the CF-mMIMO with transceiver HWIs.
comment: 51 pages, 11 figures
☆ Fast real-time arbitrary waveform generation using graphic processing units
Real-time Arbitrary Waveform Generation (AWG) is essential in various engineering and research applications, and often requires complex bespoke hardware and software. This paper introduces an AWG framework using an NVIDIA Graphics Processing Unit (GPU) and a commercially available high-speed Digital-to-Analog Converter (DAC) card, both running on a desktop personal computer (PC). The GPU accelerates the "embarrassingly" data parallel additive waveform synthesis framework for AWG, and the DAC reconstructs the generated waveform in the analog domain at high speed. The AWG framework is programmed using the developer-friendly Compute Unified Device Architecture (CUDA) runtime application programming interface from NVIDIA and is readily customizable, and scalable with additional parallel hardware. We present and characterize two different pathways for computing modulated radio-frequency (rf) waveforms: one pathway offers high-complexity simultaneous chirping of 1000 individual Nyquist-limited single-frequency tones for 35 ms at a sampling rate of 560 MB/s, and the other pathway allows simultaneous continuous chirping of 194 individual Nyquist-limited single-frequency tones at 100 MB/s, or 20 individual tones at 560 MB/s. This AWG framework is designed for fast on-the-fly rearrangement of a large stochastically-loaded optical tweezer array of single atoms or molecules into a defect-free array needed for quantum simulation and quantum computation applications.
comment: 13 pages, 10 figures
☆ Learning to walk on new ground: Calibration-free decoding for c-VEP BCI
This study explores two zero-training methods aimed at enhancing the usability of brain-computer interfaces (BCIs) by eliminating the need for a calibration session. We introduce a novel method rooted in the event-related potential (ERP) domain, unsupervised mean maximization (UMM), to the fast code-modulated visual evoked potential (c-VEP) stimulus protocol. We compare UMM to the state-of-the-art c-VEP zero-training method that uses canonical correlation analysis (CCA). The comparison includes instantaneous classification and classification with cumulative learning from previously classified trials for both CCA and UMM. Our study shows the effectiveness of both methods in navigating the complexities of a c-VEP dataset, highlighting their differences and distinct strengths. This research not only provides insights into the practical implementation of calibration-free BCI methods but also paves the way for further exploration and refinement. Ultimately, the fusion of CCA and UMM holds promise for enhancing the accessibility and usability of BCI systems across various application domains and a multitude of stimulus protocols.
comment: 6 pages, 2 figures, 9th Graz Brain-Computer Interface Conference 2024
♻ ☆ Large Language Model-informed ECG Dual Attention Network for Heart Failure Risk Prediction
Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. We introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual-attention ECG network designed to capture complex ECG features essential for early HF risk prediction, despite the notable imbalance between low and high-risk groups. This network incorporates a cross-lead attention module and twelve lead-specific temporal attention modules, focusing on cross-lead interactions and each lead's local dynamics. To further alleviate model overfitting, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the UK Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI).The results reveal that LLM-informed pre-training substantially enhances HF risk prediction in these cohorts. The dual-attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method's potential in advancing HF risk assessment with clinical complex ECG data.
comment: Under journal revision
♻ ☆ Coverage and Rate Analysis for Integrated Sensing and Communication Networks
Integrated sensing and communication (ISAC) is increasingly recognized as a pivotal technology for next-generation cellular networks, offering mutual benefits in both sensing and communication capabilities. This advancement necessitates a re-examination of the fundamental limits within networks where these two functions coexist via shared spectrum and infrastructures. However, traditional stochastic geometry-based performance analyses are confined to either communication or sensing networks separately. This paper bridges this gap by introducing a generalized stochastic geometry framework in ISAC networks. Based on this framework, we define and calculate the coverage and ergodic rate of sensing and communication performance under resource constraints. Then, we shed light on the fundamental limits of ISAC networks by presenting theoretical results for the coverage rate of the unified performance, taking into account the coupling effects of dual functions in coexistence networks. Further, we obtain the analytical formulations for evaluating the ergodic sensing rate constrained by the maximum communication rate, and the ergodic communication rate constrained by the maximum sensing rate. Extensive numerical results validate the accuracy of all theoretical derivations, and also indicate that denser networks significantly enhance ISAC coverage. Specifically, increasing the base station density from $1$ $\text{km}^{-2}$ to $10$ $\text{km}^{-2}$ can boost the ISAC coverage rate from $1.4\%$ to $39.8\%$. Further, results also reveal that with the increase of the constrained sensing rate, the ergodic communication rate improves significantly, but the reverse is not obvious.
♻ ☆ Solution-Set Geometry and Regularization Path of a Nonconvexly Regularized Convex Sparse Model
The generalized minimax concave (GMC) penalty is a nonconvex sparse regularizer which can preserve the overall-convexity of the regularized least-squares problem. In this paper, we focus on a significant instance of the GMC model termed scaled GMC (sGMC), and present various notable findings on its solution-set geometry and regularization path. Our investigation indicates that while the sGMC penalty is a nonconvex extension of the LASSO penalty (i.e., the $\ell_1$-norm), the sGMC model preserves many celebrated properties of the LASSO model, hence can serve as a less biased surrogate of LASSO without losing its advantages. Specifically, for a fixed regularization parameter $\lambda$, we show that the solution-set geometry, solution uniqueness and sparseness of the sGMC model can be characterized in a similar elegant way to the LASSO model (see, e.g., Osborne et al. 2000, R. J. Tibshirani 2013). For a varying $\lambda$, we prove that the sGMC solution set is a continuous polytope-valued mapping of $\lambda$. Most noticeably, our study indicates that similar to LASSO, the minimum $\ell_2$-norm regularization path of the sGMC model is continuous and piecewise linear in $\lambda$. Based on these theoretical results, an efficient regularization path algorithm is proposed for the sGMC model, extending the well-known least angle regression (LARS) algorithm for LASSO. We prove the correctness and finite termination of the proposed algorithm under a mild assumption, and confirm its correctness-in-general-situation, efficiency, and practical utility through numerical experiments. Many results in this study also contribute to the theoretical research of LASSO.
comment: 53 pages, 10 figures. Submitted to journal
♻ ☆ Detection Is Tracking: Point Cloud Multi-Sweep Deep Learning Models Revisited
Conventional tracking paradigm takes in instantaneous measurements such as range and bearing, and produces object tracks across time. In applications such as autonomous driving, lidar measurements in the form of point clouds are usually passed through a "virtual sensor" realized by a deep learning model, to produce "measurements" such as bounding boxes, which are in turn ingested by a tracking module to produce object tracks. Very often multiple lidar sweeps are accumulated in a buffer to merge and become the input to the virtual sensor. We argue in this paper that such an input already contains temporal information, and therefore the virtual sensor output should also contain temporal information, not just instantaneous values for the time corresponding to the end of the buffer. In particular, we present the deep learning model called MULti-Sweep PAired Detector (MULSPAD) that produces, for each detected object, a pair of bounding boxes at both the end time and the beginning time of the input buffer. This is achieved with fairly straightforward changes in commonly used lidar detection models, and with only marginal extra processing, but the resulting symmetry is satisfying. Such paired detections make it possible not only to construct rudimentary trackers fairly easily, but also to construct more sophisticated trackers that can exploit the extra information conveyed by the pair and be robust to choices of motion models and object birth/death models. We have conducted preliminary training and experimentation using Waymo Open Dataset, which shows the efficacy of our proposed method.
comment: My previous employer Motional is requiring a review and approval process before I can publish this paper
♻ ☆ Learning-Based One-Bit Maximum Likelihood Detection for Massive MIMO Systems: Dithering-Aided Adaptive Approach
In this paper, we propose a learning-based detection framework for uplink massive multiple-input and multiple-output (MIMO) systems with one-bit analog-to-digital converters. The learning-based detection only requires counting the occurrences of the quantized outputs of -1 and +1 for estimating a likelihood probability at each antenna. Accordingly, the key advantage of this approach is to perform maximum likelihood detection without explicit channel estimation which has been one of the primary challenges of one-bit quantized systems. However, due to the quasi-deterministic reception in the high signal-to-noise ratio (SNR) regime, one-bit observations in the high SNR regime are biased to either +1 or -1, and thus, the learning requires excessive training to estimate the small likelihood probabilities. To address this drawback, we propose a dither-and-learning technique to estimate likelihood functions from dithered signals. First, we add a dithering signal to artificially decrease the SNR and then infer the likelihood function from the quantized dithered signals by using an SNR estimate derived from a deep neural network-based estimator which is trained offline. We extend our technique by developing an adaptive dither-and-learning method that updates the dithering power according to the patterns observed in the quantized dithered signals. The proposed framework is also applied to channel-coded MIMO systems by computing a bit-wise and user-wise log-likelihood ratio from the refined likelihood probabilities. Simulation results validate the performance of the proposed methods in both uncoded and coded systems.
comment: Accepted for publication in IEEE Transactions on Vehicular Technologies
♻ ☆ A Survey of Predictive Maintenance: Systems, Purposes and Approaches
This paper highlights the importance of maintenance techniques in the coming industrial revolution, reviews the evolution of maintenance techniques, and presents a comprehensive literature review on the latest advancement of maintenance techniques, i.e., Predictive Maintenance (PdM), with emphasis on system architectures, optimization objectives, and optimization methods. In industry, any outages and unplanned downtime of machines or systems would degrade or interrupt a company's core business, potentially resulting in significant penalties and immeasurable reputation and economic loss. Existing traditional maintenance approaches, such as Reactive Maintenance (RM) and Preventive Maintenance (PM), suffer from high prevent and repair costs, inadequate or inaccurate mathematical degradation processes, and manual feature extraction. The incoming fourth industrial revolution is also demanding for a new maintenance paradigm to reduce the maintenance cost and downtime, and increase system availability and reliability. Predictive Maintenance (PdM) is envisioned the solution. In this survey, we first provide a high-level view of the PdM system architectures including PdM 4.0, Open System Architecture for Condition Based Monitoring (OSA-CBM), and cloud-enhanced PdM system. Then, we review the specific optimization objectives, which mainly comprise cost minimization, availability/reliability maximization, and multi-objective optimization. Furthermore, we present the optimization methods to achieve the aforementioned objectives, which include traditional Machine Learning (ML) based and Deep Learning (DL) based approaches. Finally, we highlight the future research directions that are critical to promote the application of DL techniques in the context of PdM.
comment: 38 pages, 23 figures
♻ ☆ On Secrecy Performance of RIS-Assisted MISO Systems over Rician Channels with Spatially Random Eavesdroppers
Reconfigurable intelligent surface (RIS) technology is emerging as a promising technique for performance enhancement for next-generation wireless networks. This paper investigates the physical layer security of an RIS-assisted multiple-antenna communication system in the presence of random spatially distributed eavesdroppers. The RIS-to-ground channels are assumed to experience Rician fading. Using stochastic geometry, exact distributions of the received signal-to-noise-ratios (SNRs) at the legitimate user and the eavesdroppers located according to a Poisson point process (PPP) are derived, and closed-form expressions for the secrecy outage probability (SOP) and the ergodic secrecy capacity (ESC) are obtained to provide insightful guidelines for system design. First, the secrecy diversity order is obtained as $\frac{2}{\alpha_2}$, where $\alpha_2$ denotes the path loss exponent of the RIS-to-ground links. Then, it is revealed that the secrecy performance is mainly affected by the number of RIS reflecting elements, $N$, and the impact of the number of transmit antennas and transmit power at the base station is marginal. In addition, when the locations of the randomly located eavesdroppers are unknown, deploying the RIS closer to the legitimate user rather than to the base station is shown to be more efficient. Moreover, it is also found that the density of randomly located eavesdroppers, $\lambda_e$, has an additive effect on the asymptotic ESC performance given by $\log_2{\left({1}/{\lambda_e}\right)}$. Finally, numerical simulations are conducted to verify the accuracy of these theoretical observations.
comment: Accepted by IEEE Transactions on Wireless Communications
♻ ☆ Multiscale Hodge Scattering Networks for Data Analysis
We propose new scattering networks for signals measured on simplicial complexes, which we call \emph{Multiscale Hodge Scattering Networks} (MHSNs). Our construction is based on multiscale basis dictionaries on simplicial complexes, i.e., the $\kappa$-GHWT and $\kappa$-HGLET, which we recently developed for simplices of dimension $\kappa \in \mathbb{N}$ in a given simplicial complex by generalizing the node-based Generalized Haar-Walsh Transform (GHWT) and Hierarchical Graph Laplacian Eigen Transform (HGLET). The $\kappa$-GHWT and the $\kappa$-HGLET both form redundant sets (i.e., dictionaries) of multiscale basis vectors and the corresponding expansion coefficients of a given signal. Our MHSNs use a layered structure analogous to a convolutional neural network (CNN) to cascade the moments of the modulus of the dictionary coefficients. The resulting features are invariant to reordering of the simplices (i.e., node permutation of the underlying graphs). Importantly, the use of multiscale basis dictionaries in our MHSNs admits a natural pooling operation that is akin to local pooling in CNNs, and which may be performed either locally or per-scale. These pooling operations are harder to define in both traditional scattering networks based on Morlet wavelets, and geometric scattering networks based on Diffusion Wavelets. As a result, we are able to extract a rich set of descriptive yet robust features that can be used along with very simple machine learning methods (i.e., logistic regression or support vector machines) to achieve high-accuracy classification systems with far fewer parameters to train than most modern graph neural networks. Finally, we demonstrate the usefulness of our MHSNs in three distinct types of problems: signal classification, domain (i.e., graph/simplex) classification, and molecular dynamics prediction.
comment: 20 Pages, Comments Welcome
♻ ☆ Beamforming Design for Semantic-Bit Coexisting Communication System
Semantic communication (SemCom) is emerging as a key technology for future sixth-generation (6G) systems. Unlike traditional bit-level communication (BitCom), SemCom directly optimizes performance at the semantic level, leading to superior communication efficiency. Nevertheless, the task-oriented nature of SemCom renders it challenging to completely replace BitCom. Consequently, it is desired to consider a semantic-bit coexisting communication system, where a base station (BS) serves SemCom users (sem-users) and BitCom users (bit-users) simultaneously. Such a system faces severe and heterogeneous inter-user interference. In this context, this paper provides a new semantic-bit coexisting communication framework and proposes a spatial beamforming scheme to accommodate both types of users. Specifically, we consider maximizing the semantic rate for semantic users while ensuring the quality-of-service (QoS) requirements for bit-users. Due to the intractability of obtaining the exact closed-form expression of the semantic rate, a data driven method is first applied to attain an approximated expression via data fitting. With the resulting complex transcendental function, majorization minimization (MM) is adopted to convert the original formulated problem into a multiple-ratio problem, which allows fractional programming (FP) to be used to further transform the problem into an inhomogeneous quadratically constrained quadratic programs (QCQP) problem. Solving the problem leads to a semi-closed form solution with undetermined Lagrangian factors that can be updated by a fixed point algorithm. Extensive simulation results demonstrate that the proposed beamforming scheme significantly outperforms conventional beamforming algorithms such as zero-forcing (ZF), maximum ratio transmission (MRT), and weighted minimum mean-square error (WMMSE).
comment: Submitted to IEEE for possible publication
♻ ☆ User Training with Error Augmentation for Electromyogram-based Gesture Classification
We designed and tested a system for real-time control of a user interface by extracting surface electromyographic (sEMG) activity from eight electrodes in a wrist-band configuration. sEMG data were streamed into a machine-learning algorithm that classified hand gestures in real-time. After an initial model calibration, participants were presented with one of three types of feedback during a human-learning stage: veridical feedback, in which predicted probabilities from the gesture classification algorithm were displayed without alteration, modified feedback, in which we applied a hidden augmentation of error to these probabilities, and no feedback. User performance was then evaluated in a series of minigames, in which subjects were required to use eight gestures to manipulate their game avatar to complete a task. Experimental results indicated that, relative to baseline, the modified feedback condition led to significantly improved accuracy and improved gesture class separation. These findings suggest that real-time feedback in a gamified user interface with manipulation of feedback may enable intuitive, rapid, and accurate task acquisition for sEMG-based gesture recognition applications.
comment: 10 pages, 10 figures. V2: Fix latex characters in author name. V3: Add published DOI and Copyright notice
♻ ☆ Differentially Private Communication of Measurement Anomalies in the Smart Grid
In this paper, we present a framework based on differential privacy (DP) for querying electric power measurements to detect system anomalies or bad data. Our DP approach conceals consumption and system matrix data, while simultaneously enabling an untrusted third party to test hypotheses of anomalies, such as the presence of bad data, by releasing a randomized sufficient statistic for hypothesis-testing. We consider a measurement model corrupted by Gaussian noise and a sparse noise vector representing the attack, and we observe that the optimal test statistic is a chi-square random variable. To detect possible attacks, we propose a novel DP chi-square noise mechanism that ensures the test does not reveal private information about power injections or the system matrix. The proposed framework provides a robust solution for detecting bad data while preserving the privacy of sensitive power system data.
comment: 13 pages, 5 figures
♻ ☆ Decoupled Data Consistency with Diffusion Purification for Image Restoration
Diffusion models have recently gained traction as a powerful class of deep generative priors, excelling in a wide range of image restoration tasks due to their exceptional ability to model data distributions. To solve image restoration problems, many existing techniques achieve data consistency by incorporating additional likelihood gradient steps into the reverse sampling process of diffusion models. However, the additional gradient steps pose a challenge for real-world practical applications as they incur a large computational overhead, thereby increasing inference time. They also present additional difficulties when using accelerated diffusion model samplers, as the number of data consistency steps is limited by the number of reverse sampling steps. In this work, we propose a novel diffusion-based image restoration solver that addresses these issues by decoupling the reverse process from the data consistency steps. Our method involves alternating between a reconstruction phase to maintain data consistency and a refinement phase that enforces the prior via diffusion purification. Our approach demonstrates versatility, making it highly adaptable for efficient problem-solving in latent space. Additionally, it reduces the necessity for numerous sampling steps through the integration of consistency models. The efficacy of our approach is validated through comprehensive experiments across various image restoration tasks, including image denoising, deblurring, inpainting, and super-resolution.
Sound
☆ Music to Dance as Language Translation using Sequence Models
Synthesising appropriate choreographies from music remains an open problem. We introduce MDLT, a novel approach that frames the choreography generation problem as a translation task. Our method leverages an existing data set to learn to translate sequences of audio into corresponding dance poses. We present two variants of MDLT: one utilising the Transformer architecture and the other employing the Mamba architecture. We train our method on AIST++ and PhantomDance data sets to teach a robotic arm to dance, but our method can be applied to a full humanoid robot. Evaluation metrics, including Average Joint Error and Frechet Inception Distance, consistently demonstrate that, when given a piece of music, MDLT excels at producing realistic and high-quality choreography. The code can be found at github.com/meowatthemoon/MDLT.
☆ Towards auditory attention decoding with noise-tagging: A pilot study
Auditory attention decoding (AAD) aims to extract from brain activity the attended speaker amidst candidate speakers, offering promising applications for neuro-steered hearing devices and brain-computer interfacing. This pilot study makes a first step towards AAD using the noise-tagging stimulus protocol, which evokes reliable code-modulated evoked potentials, but is minimally explored in the auditory modality. Participants were sequentially presented with two Dutch speech stimuli that were amplitude modulated with a unique binary pseudo-random noise-code, effectively tagging these with additional decodable information. We compared the decoding of unmodulated audio against audio modulated with various modulation depths, and a conventional AAD method against a standard method to decode noise-codes. Our pilot study revealed higher performances for the conventional method with 70 to 100 percent modulation depths compared to unmodulated audio. The noise-code decoder did not further improve these results. These fundamental insights highlight the potential of integrating noise-codes in speech to enhance auditory speaker detection when multiple speakers are presented simultaneously.
comment: 6 pages, 2 figures, 9th Graz Brain-Computer Interface Conference 2024
♻ ☆ Unsupervised Multi-channel Separation and Adaptation
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. The models are trained on both supervised and unsupervised training data, and are tested on real AMI recordings containing overlapping speech. To objectively evaluate our models, we also use a synthetic multi-channel AMI test set. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest improvement to SI-SNR and to human listening ratings across synthetic and real datasets, outperforming supervised models trained on well-matched synthetic data. Our results demonstrate that unsupervised learning through MixIT enables model adaptation on both single- and multi-channel real-world speech recordings.
♻ ☆ Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech ICASSP 2024
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.
comment: To appear in the proceedings of ICASSP 2024
♻ ☆ MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition
Despite notable progress, speech emotion recognition (SER) remains challenging due to the intricate and ambiguous nature of speech emotion, particularly in wild world. While current studies primarily focus on recognition and generalization abilities, our research pioneers an investigation into the reliability of SER methods in the presence of semantic data shifts and explores how to exert fine-grained control over various attributes inherent in speech signals to enhance speech emotion modeling. In this paper, we first introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER. Specifically, concentrating exclusively on the speech emotion attribute, a novel CNN-based SER model is presented to extract discriminative emotional representations, guided by additive margin softmax loss. Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes, termed Multiple Speech Attribute Control (MSAC), which empowers the proposed SER model to simultaneously capture fine-grained emotion-related features while mitigating the negative impact of emotion-agnostic representations. Furthermore, we make a first attempt to examine the reliability of the MSAC-SERNet framework using out-of-distribution detection methods. Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet not only consistently outperforms the baseline in all aspects, but achieves superior performance compared to state-of-the-art SER approaches.
comment: 12 pages
♻ ☆ Modeling of Speech-dependent Own Voice Transfer Characteristics for Hearables with In-ear Microphones
Many hearables contain an in-ear microphone, which may be used to capture the own voice of its user. However, due to the hearable occluding the ear canal, the in-ear microphone mostly records body-conducted speech, typically suffering from band-limitation effects and amplification at low frequencies. Since the occlusion effect is determined by the ratio between the air-conducted and body-conducted components of own voice, the own voice transfer characteristics between the outer face of the hearable and the in-ear microphone depend on the speech content and the individual talker. In this paper, we propose a speech-dependent model of the own voice transfer characteristics based on phoneme recognition, assuming a linear time-invariant relative transfer function for each phoneme. We consider both individual models as well as models averaged over several talkers. Experimental results based on recordings with a prototype hearable show that the proposed speech-dependent model enables to simulate in-ear signals more accurately than a speech-independent model in terms of technical measures, especially under utterance mismatch and talker mismatch. Additionally, simulation results show that talker-averaged models generalize better to different talkers than individual models.
comment: 20 pages, 11 figures; Extended version of arXiv:2309.08294 (more detailed description of the problem, additional models considered, more systematic evaluation conducted on a different, larger dataset) -> Updated version (20th march 2024): major changes after internal review; in submission
♻ ☆ Unimodal Multi-Task Fusion for Emotional Mimicry Prediction
In this study, we propose a methodology for the Emotional Mimicry Intensity (EMI) Estimation task within the context of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our approach leverages the Wav2Vec 2.0 framework, pre-trained on a comprehensive podcast dataset, to extract a broad range of audio features encompassing both linguistic and paralinguistic elements. We enhance feature representation through a fusion technique that integrates individual features with a global mean vector, introducing global contextual insights into our analysis. Additionally, we incorporate a pre-trained valence-arousal-dominance (VAD) module from the Wav2Vec 2.0 model. Our fusion employs a Long Short-Term Memory (LSTM) architecture for efficient temporal analysis of audio data. Utilizing only the provided audio data, our approach demonstrates significant improvements over the established baseline.
♻ ☆ An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.
comment: Accepted by TPAMI 2024
Robotics
☆ Augmented Reality based Simulated Data (ARSim) with multi-view consistency for AV perception networks
Detecting a diverse range of objects under various driving scenarios is essential for the effectiveness of autonomous driving systems. However, the real-world data collected often lacks the necessary diversity presenting a long-tail distribution. Although synthetic data has been utilized to overcome this issue by generating virtual scenes, it faces hurdles such as a significant domain gap and the substantial efforts required from 3D artists to create realistic environments. To overcome these challenges, we present ARSim, a fully automated, comprehensive, modular framework designed to enhance real multi-view image data with 3D synthetic objects of interest. The proposed method integrates domain adaptation and randomization strategies to address covariate shift between real and simulated data by inferring essential domain attributes from real data and employing simulation-based randomization for other attributes. We construct a simplified virtual scene using real data and strategically place 3D synthetic assets within it. Illumination is achieved by estimating light distribution from multiple images capturing the surroundings of the vehicle. Camera parameters from real data are employed to render synthetic assets in each frame. The resulting augmented multi-view consistent dataset is used to train a multi-camera perception network for autonomous vehicles. Experimental results on various AV perception tasks demonstrate the superior performance of networks trained on the augmented dataset.
comment: 17 pages, 15 figures, 7 tables
☆ OceanPlan: Hierarchical Planning and Replanning for Natural Language AUV Piloting in Large-scale Unexplored Ocean Environments IROS 2024
We develop a hierarchical LLM-task-motion planning and replanning framework to efficiently ground an abstracted human command into tangible Autonomous Underwater Vehicle (AUV) control through enhanced representations of the world. We also incorporate a holistic replanner to provide real-world feedback with all planners for robust AUV operation. While there has been extensive research in bridging the gap between LLMs and robotic missions, they are unable to guarantee success of AUV applications in the vast and unknown ocean environment. To tackle specific challenges in marine robotics, we design a hierarchical planner to compose executable motion plans, which achieves planning efficiency and solution quality by decomposing long-horizon missions into sub-tasks. At the same time, real-time data stream is obtained by a replanner to address environmental uncertainties during plan execution. Experiments validate that our proposed framework delivers successful AUV performance of long-duration missions through natural language piloting.
comment: submitted to IROS 2024
☆ Safe and Stable Teleoperation of Quadrotor UAVs under Haptic Shared Autonomy
We present a novel approach that aims to address both safety and stability of a haptic teleoperation system within a framework of Haptic Shared Autonomy (HSA). We use Control Barrier Functions (CBFs) to generate the control input that follows the user's input as closely as possible while guaranteeing safety. In the context of stability of the human-in-the-loop system, we limit the force feedback perceived by the user via a small $L_2$-gain, which is achieved by limiting the control and the force feedback via a differential constraint. Specifically, with the property of HSA, we propose two pathways to design the control and the force feedback: Sequential Control Force (SCF) and Joint Control Force (JCF). Both designs can achieve safety and stability but with different responses to the user's commands. We conducted experimental simulations to evaluate and investigate the properties of the designed methods. We also tested the proposed method on a physical quadrotor UAV and a haptic interface.
☆ Gesture-Controlled Aerial Robot Formation for Human-Swarm Interaction in Safety Monitoring Applications
This paper presents a formation control approach for contactless gesture-based Human-Swarm Interaction (HSI) between a team of multi-rotor Unmanned Aerial Vehicles (UAVs) and a human worker. The approach is intended for monitoring the safety of human workers, especially those working at heights. In the proposed dynamic formation scheme, one UAV acts as the leader of the formation and is equipped with sensors for human worker detection and gesture recognition. The follower UAVs maintain a predetermined formation relative to the worker's position, thereby providing additional perspectives of the monitored scene. Hand gestures allow the human worker to specify movements and action commands for the UAV team and initiate other mission-related commands without the need for an additional communication channel or specific markers. Together with a novel unified human detection and tracking algorithm, human pose estimation approach and gesture detection pipeline, the proposed approach forms a first instance of an HSI system incorporating all these modules onboard real-world UAVs. Simulations and field experiments with three UAVs and a human worker in a mock-up scenario showcase the effectiveness and responsiveness of the proposed approach.
comment: 8 pages, 9 figures
☆ Introduction to Human-Robot Interaction: A Multi-Perspective Introductory Course
In this paper I describe the design of an introductory course in Human-Robot Interaction. This project-driven course is designed to introduce undergraduate and graduate engineering students, especially those enrolled in Computer Science, Mechanical Engineering, and Robotics degree programs, to key theories and methods used in the field of Human-Robot Interaction that they would otherwise be unlikely to see in those degree programs. To achieve this aim, the course takes students all the way from stakeholder analysis to empirical evaluation, covering and integrating key Qualitative, Design, Computational, and Quantitative methods along the way. I detail the goals, audience, and format of the course, and provide a detailed walkthrough of the course syllabus.
comment: Presented at the Designing an Intro to HRI Course Workshop at HRI 2024 (arXiv:2403.05588)
☆ HortiBot: An Adaptive Multi-Arm System for Robotic Horticulture of Sweet Peppers IROS
Horticultural tasks such as pruning and selective harvesting are labor intensive and horticultural staff are hard to find. Automating these tasks is challenging due to the semi-structured greenhouse workspaces, changing environmental conditions such as lighting, dense plant growth with many occlusions, and the need for gentle manipulation of non-rigid plant organs. In this work, we present the three-armed system HortiBot, with two arms for manipulation and a third arm as an articulated head for active perception using stereo cameras. Its perception system detects not only peppers, but also peduncles and stems in real time, and performs online data association to build a world model of pepper plants. Collision-aware online trajectory generation allows all three arms to safely track their respective targets for observation, grasping, and cutting. We integrated perception and manipulation to perform selective harvesting of peppers and evaluated the system in lab experiments. Using active perception coupled with end-effector force torque sensing for compliant manipulation, HortiBot achieves high success rates.
comment: Submitted to International Conference on Intelligent Robots and Systems (IROS) 2024. C. Lenz and R. Menon contributed equally
☆ Guided Decoding for Robot Motion Generation and Adaption
We address motion generation for high-DoF robot arms in complex settings with obstacles, via points, etc. A significant advancement in this domain is achieved by integrating Learning from Demonstration (LfD) into the motion generation process. This integration facilitates rapid adaptation to new tasks and optimizes the utilization of accumulated expertise by allowing robots to learn and generalize from demonstrated trajectories. We train a transformer architecture on a large dataset of simulated trajectories. This architecture, based on a conditional variational autoencoder transformer, learns essential motion generation skills and adapts these to meet auxiliary tasks and constraints. Our auto-regressive approach enables real-time integration of feedback from the physical system, enhancing the adaptability and efficiency of motion generation. We show that our model can generate motion from initial and target points, but also that it can adapt trajectories in navigating complex tasks, including obstacle avoidance, via points, and meeting velocity and acceleration constraints, across platforms.
comment: 7 pages
☆ TriHelper: Zero-Shot Object Navigation with Dynamic Assistance
Navigating toward specific objects in unknown environments without additional training, known as Zero-Shot object navigation, poses a significant challenge in the field of robotics, which demands high levels of auxiliary information and strategic planning. Traditional works have focused on holistic solutions, overlooking the specific challenges agents encounter during navigation such as collision, low exploration efficiency, and misidentification of targets. To address these challenges, our work proposes TriHelper, a novel framework designed to assist agents dynamically through three primary navigation challenges: collision, exploration, and detection. Specifically, our framework consists of three innovative components: (i) Collision Helper, (ii) Exploration Helper, and (iii) Detection Helper. These components work collaboratively to solve these challenges throughout the navigation process. Experiments on the Habitat-Matterport 3D (HM3D) and Gibson datasets demonstrate that TriHelper significantly outperforms all existing baseline methods in Zero-Shot object navigation, showcasing superior success rates and exploration efficiency. Our ablation studies further underscore the effectiveness of each helper in addressing their respective challenges, notably enhancing the agent's navigation capabilities. By proposing TriHelper, we offer a fresh perspective on advancing the object navigation task, paving the way for future research in the domain of Embodied AI and visual-based navigation.
comment: 8 pages, 5 figures
☆ DITTO: Demonstration Imitation by Trajectory Transformation IROS 2024
Teaching robots new skills quickly and conveniently is crucial for the broader adoption of robotic systems. In this work, we address the problem of one-shot imitation from a single human demonstration, given by an RGB-D video recording through a two-stage process. In the first stage which is offline, we extract the trajectory of the demonstration. This entails segmenting manipulated objects and determining their relative motion in relation to secondary objects such as containers. Subsequently, in the live online trajectory generation stage, we first \mbox{re-detect} all objects, then we warp the demonstration trajectory to the current scene, and finally, we trace the trajectory with the robot. To complete these steps, our method makes leverages several ancillary models, including those for segmentation, relative object pose estimation, and grasp prediction. We systematically evaluate different combinations of correspondence and re-detection methods to validate our design decision across a diverse range of tasks. Specifically, we collect demonstrations of ten different tasks including pick-and-place tasks as well as articulated object manipulation. Finally, we perform extensive evaluations on a real robot system to demonstrate the effectiveness and utility of our approach in real-world scenarios. We make the code publicly available at http://ditto.cs.uni-freiburg.de.
comment: 8 pages, 4 figures, 3 tables, submitted to IROS 2024
☆ CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition
The integration of complementary characteristics from camera and radar data has emerged as an effective approach in 3D object detection. However, such fusion-based methods remain unexplored for place recognition, an equally important task for autonomous systems. Given that place recognition relies on the similarity between a query scene and the corresponding candidate scene, the stationary background of a scene is expected to play a crucial role in the task. As such, current well-designed camera-radar fusion methods for 3D object detection can hardly take effect in place recognition because they mainly focus on dynamic foreground objects. In this paper, a background-attentive camera-radar fusion-based method, named CRPlace, is proposed to generate background-attentive global descriptors from multi-view images and radar point clouds for accurate place recognition. To extract stationary background features effectively, we design an adaptive module that generates the background-attentive mask by utilizing the camera BEV feature and radar dynamic points. With the guidance of a background mask, we devise a bidirectional cross-attention-based spatial fusion strategy to facilitate comprehensive spatial interaction between the background information of the camera BEV feature and the radar BEV feature. As the first camera-radar fusion-based place recognition network, CRPlace has been evaluated thoroughly on the nuScenes dataset. The results show that our algorithm outperforms a variety of baseline methods across a comprehensive set of metrics (recall@1 reaches 91.2%).
☆ AV-Occupant Perceived Risk Model for Cut-In Scenarios with Empirical Evaluation
Advancements in autonomous vehicle (AV) technologies necessitate precise estimation of perceived risk to enhance user comfort, acceptance and trust. This paper introduces a novel AV-Occupant Risk (AVOR) model designed for perceived risk estimation during AV cut-in scenarios. An empirical study is conducted with 18 participants with realistic cut-in scenarios. Two factors were investigated: scenario risk and scene population. 76% of subjective risk responses indicate an increase in perceived risk at cut-in initiation. The existing perceived risk model did not capture this critical phenomenon. Our AVOR model demonstrated a significant improvement in estimating perceived risk during the early stages of cut-ins, especially for the high-risk scenario, enhancing modelling accuracy by up to 54%. The concept of the AVOR model can quantify perceived risk in other diverse driving contexts characterized by dynamic uncertainties, enhancing the reliability and human-centred focus of AV systems.
☆ Infrastructure-Assisted Collaborative Perception in Automated Valet Parking: A Safety Perspective
Environmental perception in Automated Valet Parking (AVP) has been a challenging task due to severe occlusions in parking garages. Although Collaborative Perception (CP) can be applied to broaden the field of view of connected vehicles, the limited bandwidth of vehicular communications restricts its application. In this work, we propose a BEV feature-based CP network architecture for infrastructure-assisted AVP systems. The model takes the roadside camera and LiDAR as optional inputs and adaptively fuses them with onboard sensors in a unified BEV representation. Autoencoder and downsampling are applied for channel-wise and spatial-wise dimension reduction, while sparsification and quantization further compress the feature map with little loss in data precision. Combining these techniques, the size of a BEV feature map is effectively compressed to fit in the feasible data rate of the NR-V2X network. With the synthetic AVP dataset, we observe that CP can effectively increase perception performance, especially for pedestrians. Moreover, the advantage of infrastructure-assisted CP is demonstrated in two typical safety-critical scenarios in the AVP setting, increasing the maximum safe cruising speed by up to 3m/s in both scenarios.
comment: 7 pages, 7 figures, 4 tables, accepted by IEEE VTC2024-Spring
☆ RHINO-VR Experience: Teaching Mobile Robotics Concepts in an Interactive Museum Exhibit
In 1997, the very first tour guide robot RHINO was deployed in a museum in Germany. With the ability to navigate autonomously through the environment, the robot gave tours to over 2,000 visitors. Today, RHINO itself has become an exhibit and is no longer operational. In this paper, we present RHINO-VR, an interactive museum exhibit using virtual reality (VR) that allows museum visitors to experience the historical robot RHINO in operation in a virtual museum. RHINO-VR, unlike static exhibits, enables users to familiarize themselves with basic mobile robotics concepts without the fear of damaging the exhibit. In the virtual environment, the user is able to interact with RHINO in VR by pointing to a location to which the robot should navigate and observing the corresponding actions of the robot. To include other visitors who cannot use the VR, we provide an external observation view to make RHINO visible to them. We evaluated our system by measuring the frame rate of the VR simulation, comparing the generated virtual 3D models with the originals, and conducting a user study. The user-study showed that RHINO-VR improved the visitors' understanding of the robot's functionality and that they would recommend experiencing the VR exhibit to others.
comment: Submitted to IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN)
☆ ALPINE: a climbing robot for operations in mountain environments
Mountain slopes are perfect examples of harsh environments in which humans are required to perform difficult and dangerous operations such as removing unstable boulders, dangerous vegetation or deploying safety nets. A good replacement for human intervention can be offered by climbing robots. The different solutions existing in the literature are not up to the task for the difficulty of the requirements (navigation, heavy payloads, flexibility in the execution of the tasks). In this paper, we propose a robotic platform that can fill this gap. Our solution is based on a robot that hangs on ropes, and uses a retractable leg to jump away from the mountain walls. Our package of mechanical solutions, along with the algorithms developed for motion planning and control, delivers swift navigation on irregular and steep slopes, the possibility to overcome or travel around significant natural barriers, and the ability to carry heavy payloads and execute complex tasks. In the paper, we give a full account of our main design and algorithmic choices and show the feasibility of the solution through a large number of physically simulated scenarios.
☆ Collision Avoidance Safety Filter for an Autonomous E-Scooter using Ultrasonic Sensors
In this paper, we propose a collision avoidance safety filter for autonomous electric scooters to enable safe operation of such vehicles in pedestrian areas. In particular, we employ multiple low-cost ultrasonic sensors to detect a wide range of possible obstacles in front of the e-scooter. Based on possibly faulty distance measurements, we design a filter to mitigate measurement noise and missing values as well as a gain-scheduled controller to limit the velocity commanded to the e-scooter when required due to imminent collisions. The proposed controller structure is able to prevent collisions with unknown obstacles by deploying a reduced safe velocity ensuring a sufficiently large safety distance. The collision avoidance approach is designed such that it may be easily deployed in similar applications of general micromobility vehicles. The effectiveness of our proposed safety filter is demonstrated in real-world experiments.
☆ Set-membership target search and tracking within an unknown cluttered area using cooperating UAVs equipped with vision systems
This paper addresses the problem of target search and tracking using a fleet of cooperating UAVs evolving in some unknown region of interest containing an a priori unknown number of moving ground targets. Each drone is equipped with an embedded Computer Vision System (CVS), providing an image with labeled pixels and a depth map of the observed part of its environment. Moreover, a box containing the corresponding pixels in the image frame is available when a UAV identifies a target. Hypotheses regarding information provided by the pixel classification, depth map construction, and target identification algorithms are proposed to allow its exploitation by set-membership approaches. A set-membership target location estimator is developed using the information provided by the CVS. Each UAV evaluates sets guaranteed to contain the location of the identified targets and a set possibly containing the locations of targets still to be identified. Then, each UAV uses these sets to search and track targets cooperatively.
☆ PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation IROS2024
Humans seemingly incorporate potential touch signals in their perception. Our goal is to equip robots with a similar capability, which we term \ourmodel. \ourmodel aims to predict the expected touch signal based on a visual patch representing the touched area. We frame this problem as the task of learning a low-dimensional visual-tactile embedding, wherein we encode a depth patch from which we decode the tactile signal. To accomplish this task, we employ ReSkin, an inexpensive and replaceable magnetic-based tactile sensor. Using ReSkin, we collect and train PseudoTouch on a dataset comprising aligned tactile and visual data pairs obtained through random touching of eight basic geometric shapes. We demonstrate the efficacy of PseudoTouch through its application to two downstream tasks: object recognition and grasp stability prediction. In the object recognition task, we evaluate the learned embedding's performance on a set of five basic geometric shapes and five household objects. Using PseudoTouch, we achieve an object recognition accuracy 84% after just ten touches, surpassing a proprioception baseline. For the grasp stability task, we use ACRONYM labels to train and evaluate a grasp success predictor using PseudoTouch's predictions derived from virtual depth information. Our approach yields an impressive 32% absolute improvement in accuracy compared to the baseline relying on partial point cloud data. We make the data, code, and trained models publicly available at http://pseudotouch.cs.uni-freiburg.de.
comment: 8 pages, 7 figures, 2 tables, submitted to IROS2024
☆ Learning from Visual Demonstrations through Differentiable Nonlinear MPC for Personalized Autonomous Driving
Human-like autonomous driving controllers have the potential to enhance passenger perception of autonomous vehicles. This paper proposes DriViDOC: a model for Driving from Vision through Differentiable Optimal Control, and its application to learn personalized autonomous driving controllers from human demonstrations. DriViDOC combines the automatic inference of relevant features from camera frames with the properties of nonlinear model predictive control (NMPC), such as constraint satisfaction. Our approach leverages the differentiability of parametric NMPC, allowing for end-to-end learning of the driving model from images to control. The model is trained on an offline dataset comprising various driving styles collected on a motion-base driving simulator. During online testing, the model demonstrates successful imitation of different driving styles, and the interpreted NMPC parameters provide insights into the achievement of specific driving behaviors. Our experimental results show that DriViDOC outperforms other methods involving NMPC and neural networks, exhibiting an average improvement of 20% in imitation scores.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Accompanying video available at: https://youtu.be/WxWPuAtJ08E
☆ Subequivariant Reinforcement Learning Framework for Coordinated Motion Control
Effective coordination is crucial for motion control with reinforcement learning, especially as the complexity of agents and their motions increases. However, many existing methods struggle to account for the intricate dependencies between joints. We introduce CoordiGraph, a novel architecture that leverages subequivariant principles from physics to enhance coordination of motion control with reinforcement learning. This method embeds the principles of equivariance as inherent patterns in the learning process under gravity influence, which aids in modeling the nuanced relationships between joints vital for motion control. Through extensive experimentation with sophisticated agents in diverse environments, we highlight the merits of our approach. Compared to current leading methods, CoordiGraph notably enhances generalization and sample efficiency.
comment: 7 pages, 7 figures, 2024 IEEE International Conference on Robotics and Automation
☆ Automated Feature Selection for Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) is an imitation learning approach to learning reward functions from expert demonstrations. Its use avoids the difficult and tedious procedure of manual reward specification while retaining the generalization power of reinforcement learning. In IRL, the reward is usually represented as a linear combination of features. In continuous state spaces, the state variables alone are not sufficiently rich to be used as features, but which features are good is not known in general. To address this issue, we propose a method that employs polynomial basis functions to form a candidate set of features, which are shown to allow the matching of statistical moments of state distributions. Feature selection is then performed for the candidates by leveraging the correlation between trajectory probabilities and feature expectations. We demonstrate the approach's effectiveness by recovering reward functions that capture expert policies across non-linear control tasks of increasing complexity. Code, data, and videos are available at https://sites.google.com/view/feature4irl.
comment: 7 pages, 4 figures
☆ A Twin Delayed Deep Deterministic Policy Gradient Algorithm for Autonomous Ground Vehicle Navigation via Digital Twin Perception Awareness
Autonomous ground vehicle (UGV) navigation has the potential to revolutionize the transportation system by increasing accessibility to disabled people, ensure safety and convenience of use. However, UGV requires extensive and efficient testing and evaluation to ensure its acceptance for public use. This testing are mostly done in a simulator which result to sim2real transfer gap. In this paper, we propose a digital twin perception awareness approach for the control of robot navigation without prior creation of the virtual environment (VT) environment state. To achieve this, we develop a twin delayed deep deterministic policy gradient (TD3) algorithm that ensures collision avoidance and goal-based path planning. We demonstrate the performance of our approach on different environment dynamics. We show that our approach is capable of efficiently avoiding collision with obstacles and navigating to its desired destination, while at the same time safely avoids obstacles using the information received from the LIDAR sensor mounted on the robot. Our approach bridges the gap between sim-to-real transfer and contributes to the adoption of UGVs in real world. We validate our approach in simulation and a real-world application in an office space.
comment: 8 pages, 7 figures
☆ Rethinking 6-Dof Grasp Detection: A Flexible Framework for High-Quality Grasping
Robotic grasping is a primitive skill for complex tasks and is fundamental to intelligence. For general 6-Dof grasping, most previous methods directly extract scene-level semantic or geometric information, while few of them consider the suitability for various downstream applications, such as target-oriented grasping. Addressing this issue, we rethink 6-Dof grasp detection from a grasp-centric view and propose a versatile grasp framework capable of handling both scene-level and target-oriented grasping. Our framework, FlexLoG, is composed of a Flexible Guidance Module and a Local Grasp Model. Specifically, the Flexible Guidance Module is compatible with both global (e.g., grasp heatmap) and local (e.g., visual grounding) guidance, enabling the generation of high-quality grasps across various tasks. The Local Grasp Model focuses on object-agnostic regional points and predicts grasps locally and intently. Experiment results reveal that our framework achieves over 18% and 23% improvement on unseen splits of the GraspNet-1Billion Dataset. Furthermore, real-world robotic tests in three distinct settings yield a 95% success rate.
comment: 8 pages, 8 figures
☆ Linear Quadratic Guidance Law for Joint Motion Planning of a Pursuer-Turret Assembly
This paper presents joint motion planning of a vehicle with an attached rotating turret. The turret has a limited range as well as the field of view. The objective is capture a maneuvering target such that at the terminal time it is withing the field-of-view and range limits. Catering to it, we present a minimum effort guidance law that commensurate for the turn rate abilities of the vehicle and the turret. The guidance law is obtained using linearization about the collision triangle and admits an analytical solution. Simulation results are presented to exemplify the cooperation between the turret and the vehicle.
☆ Boundary-Aware Value Function Generation for Safe Stochastic Motion Planning
Navigation safety is critical for many autonomous systems such as self-driving vehicles in an urban environment. It requires an explicit consideration of boundary constraints that describe the borders of any infeasible, non-navigable, or unsafe regions. We propose a principled boundary-aware safe stochastic planning framework with promising results. Our method generates a value function that can strictly distinguish the state values between free (safe) and non-navigable (boundary) spaces in the continuous state, naturally leading to a safe boundary-aware policy. At the core of our solution lies a seamless integration of finite elements and kernel-based functions, where the finite elements allow us to characterize safety-critical states' borders accurately, and the kernel-based function speeds up computation for the non-safety-critical states. The proposed method was evaluated through extensive simulations and demonstrated safe navigation behaviors in mobile navigation tasks. Additionally, we demonstrate that our approach can maneuver safely and efficiently in cluttered real-world environments using a ground vehicle with strong external disturbances, such as navigating on a slippery floor and against external human intervention.
comment: Accepted by International Journal of Robotics Research
☆ SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning
An interactive social robotic assistant must provide services in complex and crowded spaces while adapting its behavior based on real-time human language commands or feedback. In this paper, we propose a novel hybrid approach called Social Robot Planner (SRLM), which integrates Large Language Models (LLM) and Deep Reinforcement Learning (DRL) to navigate through human-filled public spaces and provide multiple social services. SRLM infers global planning from human-in-loop commands in real-time, and encodes social information into a LLM-based large navigation model (LNM) for low-level motion execution. Moreover, a DRL-based planner is designed to maintain benchmarking performance, which is blended with LNM by a large feedback model (LFM) to address the instability of current text and LLM-driven LNM. Finally, SRLM demonstrates outstanding performance in extensive experiments. More details about this work are available at: https://sites.google.com/view/navi-srlm
☆ CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments
We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based navigation behaviors as simple text prompts (e.g. ``stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image locations in the real-world, direct the VLM's attention solely to navigable locations, and elucidate the spatial relationships between them and terrains depicted in the image to the VLM. Next, we query the VLM to select numbers on the marked image that satisfy the context-based behavior text prompt, and construct a reference path using the selected numbers. Finally, we propose a method to extrapolate the reference trajectory when the robot's environmental context has not changed to prevent unnecessary VLM queries. We use the reference trajectory to guide a motion planner, and demonstrate that it leads to human-like behaviors (e.g. not cutting through a group of people, using crosswalks, etc.) in various real-world indoor and outdoor scenarios.
comment: 9 pages, 4 figures
☆ Global Games with Negative Feedback for Autonomous Colony Maintenance using Robot Teams
In this article we address the colony maintenance problem, where a team of robots are tasked with continuously maintaining the energy supply of an autonomous colony. We model this as a global game, where robots measure the energy level of a central nest to determine whether or not to forage for energy sources. We design a mechanism that avoids the trivial equilibrium where all robots always forage. Furthermore, we demonstrate that when the game is played iteratively a negative feedback term stabilizes the number of foraging robots at a non-trivial Nash equilibrium. We compare our approach qualitatively to existing global games, where a positive positive feedback term admits threshold-based decision making, and encourages many robots to forage simultaneously. We discuss how positive feedback can lead to a cascading failure in the presence of a human who recruits robots for external tasks, and we demonstrate the performance of our approach in simulation.
comment: 6 pages, 5 figures
☆ Autonomous Driving With Perception Uncertainties: Deep-Ensemble Based Adaptive Cruise Control
Autonomous driving depends on perception systems to understand the environment and to inform downstream decision-making. While advanced perception systems utilizing black-box Deep Neural Networks (DNNs) demonstrate human-like comprehension, their unpredictable behavior and lack of interpretability may hinder their deployment in safety critical scenarios. In this paper, we develop an Ensemble of DNN regressors (Deep Ensemble) that generates predictions with quantification of prediction uncertainties. In the scenario of Adaptive Cruise Control (ACC), we employ the Deep Ensemble to estimate distance headway to the lead vehicle from RGB images and enable the downstream controller to account for the estimation uncertainty. We develop an adaptive cruise controller that utilizes Stochastic Model Predictive Control (MPC) with chance constraints to provide a probabilistic safety guarantee. We evaluate our ACC algorithm using a high-fidelity traffic simulator and a real-world traffic dataset and demonstrate the ability of the proposed approach to effect speed tracking and car following while maintaining a safe distance headway. The out-of-distribution scenarios are also examined.
☆ Music to Dance as Language Translation using Sequence Models
Synthesising appropriate choreographies from music remains an open problem. We introduce MDLT, a novel approach that frames the choreography generation problem as a translation task. Our method leverages an existing data set to learn to translate sequences of audio into corresponding dance poses. We present two variants of MDLT: one utilising the Transformer architecture and the other employing the Mamba architecture. We train our method on AIST++ and PhantomDance data sets to teach a robotic arm to dance, but our method can be applied to a full humanoid robot. Evaluation metrics, including Average Joint Error and Frechet Inception Distance, consistently demonstrate that, when given a piece of music, MDLT excels at producing realistic and high-quality choreography. The code can be found at github.com/meowatthemoon/MDLT.
♻ ☆ Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting
We present a dense simultaneous localization and mapping (SLAM) method that uses 3D Gaussians as a scene representation. Our approach enables interactive-time reconstruction and photo-realistic rendering from real-world single-camera RGBD videos. To this end, we propose a novel effective strategy for seeding new Gaussians for newly explored areas and their effective online optimization that is independent of the scene size and thus scalable to larger scenes. This is achieved by organizing the scene into sub-maps which are independently optimized and do not need to be kept in memory. We further accomplish frame-to-model camera tracking by minimizing photometric and geometric losses between the input and rendered frames. The Gaussian representation allows for high-quality photo-realistic real-time rendering of real-world scenes. Evaluation on synthetic and real-world datasets demonstrates competitive or superior performance in mapping, tracking, and rendering compared to existing neural dense SLAM methods.
♻ ☆ A Convex Formulation of Frictional Contact for the Material Point Method and Rigid Bodies
In this paper, we introduce a novel convex formulation that seamlessly integrates the Material Point Method (MPM) with articulated rigid body dynamics in frictional contact scenarios. We extend the linear corotational hyperelastic model into the realm of elastoplasticity and include an efficient return mapping algorithm. This approach is particularly effective for MPM simulations involving significant deformation and topology changes, while preserving the convexity of the optimization problem. Our method ensures global convergence, enabling the use of large simulation time steps without compromising robustness. We have validated our approach through rigorous testing and performance evaluations, highlighting its superior capabilities in managing complex simulations relevant to robotics. Compared to previous MPM based robotic simulators, our method significantly improves the stability of contact resolution -- a critical factor in robot manipulation tasks. We make our method available in the open-source robotics toolkit, Drake.
comment: The supplemental video is available at https://youtu.be/5jrQtF5D0DA
♻ ☆ Learning High-level Semantic-Relational Concepts for SLAM
Recent works on SLAM extend their pose graphs with higher-level semantic concepts like Rooms exploiting relationships between them, to provide, not only a richer representation of the situation/environment but also to improve the accuracy of its estimation. Concretely, our previous work, Situational Graphs (S-Graphs+), a pioneer in jointly leveraging semantic relationships in the factor optimization process, relies on semantic entities such as Planes and Rooms, whose relationship is mathematically defined. Nevertheless, there is no unique approach to finding all the hidden patterns in lower-level factor-graphs that correspond to high-level concepts of different natures. It is currently tackled with ad-hoc algorithms, which limits its graph expressiveness. To overcome this limitation, in this work, we propose an algorithm based on Graph Neural Networks for learning high-level semantic-relational concepts that can be inferred from the low-level factor graph. Given a set of mapped Planes our algorithm is capable of inferring Room entities relating to the Planes. Additionally, to demonstrate the versatility of our method, our algorithm can infer an additional semantic-relational concept, i.e. Wall, and its relationship with its Planes. We validate our method in both simulated and real datasets demonstrating improved performance over two baseline approaches. Furthermore, we integrate our method into the S-Graphs+ algorithm providing improved pose and map accuracy compared to the baseline while further enhancing the scene representation.
♻ ☆ LaMI: Large Language Models for Multi-Modal Human-Robot Interaction
This paper presents an innovative large language model (LLM)-based robotic system for enhancing multi-modal human-robot interaction (HRI). Traditional HRI systems relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, our system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic actions" and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach. Supplementary material can be found at https://hri-eu.github.io/Lami/
comment: 10 pages, 6 figures
♻ ☆ Bi-KVIL: Keypoints-based Visual Imitation Learning of Bimanual Manipulation Tasks
Visual imitation learning has achieved impressive progress in learning unimanual manipulation tasks from a small set of visual observations, thanks to the latest advances in computer vision. However, learning bimanual coordination strategies and complex object relations from bimanual visual demonstrations, as well as generalizing them to categorical objects in novel cluttered scenes remain unsolved challenges. In this paper, we extend our previous work on keypoints-based visual imitation learning (\mbox{K-VIL})~\cite{gao_kvil_2023} to bimanual manipulation tasks. The proposed Bi-KVIL jointly extracts so-called \emph{Hybrid Master-Slave Relationships} (HMSR) among objects and hands, bimanual coordination strategies, and sub-symbolic task representations. Our bimanual task representation is object-centric, embodiment-independent, and viewpoint-invariant, thus generalizing well to categorical objects in novel scenes. We evaluate our approach in various real-world applications, showcasing its ability to learn fine-grained bimanual manipulation tasks from a small number of human demonstration videos. Videos and source code are available at https://sites.google.com/view/bi-kvil.
♻ ☆ Event-based Simultaneous Localization and Mapping: A Comprehensive Survey
In recent decades, visual simultaneous localization and mapping (vSLAM) has gained significant interest in both academia and industry. It estimates camera motion and reconstructs the environment concurrently using visual sensors on a moving robot. However, conventional cameras are limited by hardware, including motion blur and low dynamic range, which can negatively impact performance in challenging scenarios like high-speed motion and high dynamic range illumination. Recent studies have demonstrated that event cameras, a new type of bio-inspired visual sensor, offer advantages such as high temporal resolution, dynamic range, low power consumption, and low latency. This paper presents a timely and comprehensive review of event-based vSLAM algorithms that exploit the benefits of asynchronous and irregular event streams for localization and mapping tasks. The review covers the working principle of event cameras and various event representations for preprocessing event data. It also categorizes event-based vSLAM methods into four main categories: feature-based, direct, motion-compensation, and deep learning methods, with detailed discussions and practical guidance for each approach. Furthermore, the paper evaluates the state-of-the-art methods on various benchmarks, highlighting current challenges and future opportunities in this emerging research area. A public repository will be maintained to keep track of the rapid developments in this field at {\url{https://github.com/kun150kun/ESLAM-survey}}.
♻ ☆ Robust Direct Data-Driven Control for Probabilistic Systems
We propose a data-driven control method for systems with aleatoric uncertainty, for example, robot fleets with variations between agents. Our method leverages shared trajectory data to increase the robustness of the designed controller and thus facilitate transfer to new variations without the need for prior parameter and uncertainty estimations. In contrast to existing work on experience transfer for performance, our approach focuses on robustness and uses data collected from multiple realizations to guarantee generalization to unseen ones. Our method is based on scenario optimization combined with recent formulations for direct data-driven control. We derive lower bounds on the amount of data required to achieve quadratic stability for probabilistic systems with aleatoric uncertainty and demonstrate the benefits of our data-driven method through a numerical example. We find that the learned controllers generalize well to high variations in the dynamics even when based on only a few short open-loop trajectories. Robust experience transfer enables the design of safe and robust controllers that work out of the box without any additional learning during deployment.
♻ ☆ A Wind-Aware Path Planning Method for UAV-Asisted Bridge Inspection
In response to the gap in considering wind conditions in the bridge inspection using unmanned aerial vehicle (UAV) , this paper proposes a path planning method for UAVs that takes into account the influence of wind, based on the simulated annealing algorithm. The algorithm considers the wind factors, including the influence of different wind speeds and directions at the same time on the path planning of the UAV. Firstly, An environment model is constructed specifically for UAV bridge inspection, taking into account the various objective functions and constraint conditions of UAVs. A more sophisticated and precise mathematical model is then developed based on this environmental model to enable efficient and effective UAV path planning. Secondly, the bridge separation planning model is applied in a novel way, and a series of parameters are simulated, including the adjustment of the initial temperature value. The experimental results demonstrate that, compared with traditional local search algorithms, the proposed method achieves a cost reduction of 30.05\% and significantly improves effectiveness. Compared to path planning methods that do not consider wind factors, the proposed approach yields more realistic and practical results for UAV applications, as demonstrated by its improved effectiveness in simulations. These findings highlight the value of our method in facilitating more accurate and efficient UAV path planning in wind-prone environments.
comment: After carefully analysis, there is a bit design flaws in Algorithm 1. The experimental work of the paper is not comprehensive,which lacks an evaluation of the algorithm's running time
♻ ☆ Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation
As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of ``getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success)
♻ ☆ A Survey on Global LiDAR Localization: Challenges, Advances and Open Problems
Knowledge about the own pose is key for all mobile robot applications. Thus pose estimation is part of the core functionalities of mobile robots. Over the last two decades, LiDAR scanners have become the standard sensor for robot localization and mapping. This article aims to provide an overview of recent progress and advancements in LiDAR-based global localization. We begin by formulating the problem and exploring the application scope. We then present a review of the methodology, including recent advancements in several topics, such as maps, descriptor extraction, and cross-robot localization. The contents of the article are organized under three themes. The first theme concerns the combination of global place retrieval and local pose estimation. The second theme is upgrading single-shot measurements to sequential ones for sequential global localization. Finally, the third theme focuses on extending single-robot global localization to cross-robot localization in multi-robot systems. We conclude the survey with a discussion of open challenges and promising directions in global LiDAR localization. To our best knowledge, this is the first comprehensive survey on global LiDAR localization for mobile robots.
comment: Publishe on International Journal of Computer Vision (IJCV)
♻ ☆ Kinematic Modularity of Elementary Dynamic Actions
In this paper, a kinematically modular approach to robot control is presented. The method involves structures called Elementary Dynamic Actions and a network model combining these elements. With this control framework, a rich repertoire of movements can be generated by combination of basic modules. The problems of solving inverse kinematics, managing kinematic singularity and kinematic redundancy are avoided. The modular approach is robust against contact and physical interaction, which makes it particularly effective for contact-rich manipulation. Each kinematic module can be learned by Imitation Learning, thereby resulting in a modular learning strategy for robot control. The theoretical foundations and their real robot implementation are presented. Using a KUKA LBR iiwa14 robot, three tasks were considered: (1) generating a sequence of discrete movements, (2) generating a combination of discrete and rhythmic movements, and (3) a drawing and erasing task. The results obtained indicate that this modular approach has the potential to simplify the generation of a diverse range of robot actions.
comment: 8 pages, 4 figures
♻ ☆ Learning Hierarchical Control For Multi-Agent Capacity-Constrained Systems
This paper introduces a novel data-driven hierarchical control scheme for managing a fleet of nonlinear, capacity-constrained autonomous agents in an iterative environment. We propose a control framework consisting of a high-level dynamic task assignment and routing layer and low-level motion planning and tracking layer. Each layer of the control hierarchy uses a data-driven Model Predictive Control (MPC) policy, maintaining bounded computational complexity at each calculation of a new task assignment or actuation input. We utilize collected data to iteratively refine estimates of agent capacity usage, and update MPC policy parameters accordingly. Our approach leverages tools from iterative learning control to integrate learning at both levels of the hierarchy, and coordinates learning between levels in order to maintain closed-loop feasibility and performance improvement of the connected architecture.
♻ ☆ iSLAM: Imperative SLAM
Simultaneous Localization and Mapping (SLAM) stands as one of the critical challenges in robot navigation. A SLAM system often consists of a front-end component for motion estimation and a back-end system for eliminating estimation drifts. Recent advancements suggest that data-driven methods are highly effective for front-end tasks, while geometry-based methods continue to be essential in the back-end processes. However, such a decoupled paradigm between the data-driven front-end and geometry-based back-end can lead to sub-optimal performance, consequently reducing the system's capabilities and generalization potential. To solve this problem, we proposed a novel self-supervised imperative learning framework, named imperative SLAM (iSLAM), which fosters reciprocal correction between the front-end and back-end, thus enhancing performance without necessitating any external supervision. Specifically, we formulate the SLAM problem as a bilevel optimization so that the front-end and back-end are bidirectionally connected. As a result, the front-end model can learn global geometric knowledge obtained through pose graph optimization by back-propagating the residuals from the back-end component. We showcase the effectiveness of this new framework through an application of stereo-inertial SLAM. The experiments show that the iSLAM training strategy achieves an accuracy improvement of 22% on average over a baseline model. To the best of our knowledge, iSLAM is the first SLAM system showing that the front-end and back-end components can mutually correct each other in a self-supervised manner.
comment: The paper has been accepted by IEEE Robotics and Automation Letters (RA-L)
♻ ☆ AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers
For effective human-robot interaction, robots need to understand, plan, and execute complex, long-horizon tasks described by natural language. Recent advances in large language models (LLMs) have shown promise for translating natural language into robot action sequences for complex tasks. However, existing approaches either translate the natural language directly into robot trajectories or factor the inference process by decomposing language into task sub-goals and relying on a motion planner to execute each sub-goal. When complex environmental and temporal constraints are involved, inference over planning tasks must be performed jointly with motion plans using traditional task-and-motion planning (TAMP) algorithms, making factorization into subgoals untenable. Rather than using LLMs to directly plan task sub-goals, we instead perform few-shot translation from natural language task descriptions to an intermediate task representation that can then be consumed by a TAMP algorithm to jointly solve the task and motion plan. To improve translation, we automatically detect and correct both syntactic and semantic errors via autoregressive re-prompting, resulting in significant improvements in task completion. We show that our approach outperforms several methods using LLMs as planners in complex task domains. See our project website https://yongchao98.github.io/MIT-REALM-AutoTAMP/ for prompts, videos, and code.
comment: 8 pages, 4 figures
♻ ☆ Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?
A flurry of recent work has demonstrated that pre-trained large language models (LLMs) can be effective task planners for a variety of single-robot tasks. The planning performance of LLMs is significantly improved via prompting techniques, such as in-context learning or re-prompting with state feedback, placing new importance on the token budget for the context window. An under-explored but natural next direction is to investigate LLMs as multi-robot task planners. However, long-horizon, heterogeneous multi-robot planning introduces new challenges of coordination while also pushing up against the limits of context window length. It is therefore critical to find token-efficient LLM planning frameworks that are also able to reason about the complexities of multi-robot coordination. In this work, we compare the task success rate and token efficiency of four multi-agent communication frameworks (centralized, decentralized, and two hybrid) as applied to four coordination-dependent multi-agent 2D task scenarios for increasing numbers of agents. We find that a hybrid framework achieves better task success rates across all four tasks and scales better to more agents. We further demonstrate the hybrid frameworks in 3D simulations where the vision-to-text problem and dynamical errors are considered. See our project website https://yongchao98.github.io/MIT-REALM-Multi-Robot/ for prompts, videos, and code.
comment: 7 pages, 8 figures
♻ ☆ Learning Complex Motion Plans using Neural ODEs with Safety and Stability Guarantees ICRA 2024
We propose a Dynamical System (DS) approach to learn complex, possibly periodic motion plans from kinesthetic demonstrations using Neural Ordinary Differential Equations (NODE). To ensure reactivity and robustness to disturbances, we propose a novel approach that selects a target point at each time step for the robot to follow, by combining tools from control theory and the target trajectory generated by the learned NODE. A correction term to the NODE model is computed online by solving a quadratic program that guarantees stability and safety using control Lyapunov functions and control barrier functions, respectively. Our approach outperforms baseline DS learning techniques on the LASA handwriting dataset and complex periodic trajectories. It is also validated on the Franka Emika robot arm to produce stable motions for wiping and stirring tasks that do not have a single attractor, while being robust to perturbations and safe around humans and obstacles.
comment: accepted to ICRA 2024
♻ ☆ A Traffic Management Framework for On-Demand Urban Air Mobility Systems
Urban Air Mobility (UAM) offers a solution to current traffic congestion by providing on-demand air mobility in urban areas. Effective traffic management is crucial for efficient operation of UAM systems, especially for high-demand scenarios. In this paper, we present a centralized traffic management framework for on-demand UAM systems. Specifically, we provide a scheduling policy, called VertiSync, which schedules the aircraft for either servicing trip requests or rebalancing in the system subject to aircraft safety margins and energy requirements. We characterize the system-level throughput of VertiSync, which determines the demand threshold at which passenger waiting times transition from being stabilized to being increasing over time. We show that the proposed policy is able to maximize throughput for sufficiently large fleet sizes. We demonstrate the performance of VertiSync through a case study for the city of Los Angeles, and show that it significantly reduces passenger waiting times compared to a first-come first-serve scheduling policy.
comment: 9 pages, 6 figures
♻ ☆ Kinematics-aware Trajectory Generation and Prediction with Latent Stochastic Differential Modeling
Trajectory generation and trajectory prediction are two critical tasks in autonomous driving, which generate various trajectories for testing during development and predict the trajectories of surrounding vehicles during operation, respectively. In recent years, emerging data-driven deep learning-based methods have shown great promise for these two tasks in learning various traffic scenarios and improving average performance without assuming physical models. However, it remains a challenging problem for these methods to ensure that the generated/predicted trajectories are physically realistic. This challenge arises because learning-based approaches often function as opaque black boxes and do not adhere to physical laws. Conversely, existing model-based methods provide physically feasible results but are constrained by predefined model structures, limiting their capabilities to address complex scenarios. To address the limitations of these two types of approaches, we propose a new method that integrates kinematic knowledge into neural stochastic differential equations (SDE) and designs a variational autoencoder based on this latent kinematics-aware SDE (LK-SDE) to generate vehicle motions. Experimental results demonstrate that our method significantly outperforms both model-based and learning-based baselines in producing physically realistic and precisely controllable vehicle trajectories. Additionally, it performs well in predicting unobservable physical variables in the latent space.
comment: 8 pages, conference paper in motion generation
Systems and Control
☆ Control Designs for Critical-Continegency Responsible Grid-Following Inverters and Seamless Transitions To and From Grid-Forming Modes
This article introduces two control frameworks: one for Grid-Following (GFL) inverters aiding Grid-Forming (GFM) inverters in voltage regulation during large contingency events and optimizing power transactions under normal conditions; and another for seamless transitions between grid-tied and grid-isolated setups, managing voltage transient characteristics. In microgrids, GFM inverters regulate voltage, while GFL inverters handle power transactions. The proposed GFL control detects abrupt load/generation changes, adjusting power transactions using local storage to support GFM inverters during contingencies. Additionally, a transition control ensures smooth GFL-GFM shifts, reducing power and voltage fluctuations. Simulation results validate improved voltage regulation during contingencies and enhanced power tracking during slow changes, alongside minimized transient overshoot.
comment: 6 pages, 6 figures, submitted and accepted for 2024 American Control Conference (ACC)
☆ Cascading Blackout Severity Prediction with Statistically-Augmented Graph Neural Networks SC
Higher variability in grid conditions, resulting from growing renewable penetration and increased incidence of extreme weather events, has increased the difficulty of screening for scenarios that may lead to catastrophic cascading failures. Traditional power-flow-based tools for assessing cascading blackout risk are too slow to properly explore the space of possible failures and load/generation patterns. We add to the growing literature of faster graph-neural-network (GNN)-based techniques, developing two novel techniques for the estimation of blackout magnitude from initial grid conditions. First we propose several methods for employing an initial classification step to filter out safe "non blackout" scenarios prior to magnitude estimation. Second, using insights from the statistical properties of cascading blackouts, we propose a method for facilitating non-local message passing in our GNN models. We validate these two approaches on a large simulated dataset, and show the potential of both to increase blackout size estimation performance.
comment: Accepted to Power Systems Computation Conference (PSCC) 2024
☆ SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, liquid S4, LRU, Mega, and Mamba), have emerged to address the above issues to help handle longer sequence lengths. Mamba, while being the state-of-the-art SSM, has a stability issue when scaled to large networks for computer vision datasets. We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers. Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet and transfer learning benchmarks such as Stanford Car and Flower as well as task learning benchmarks as well as seven time series benchmark datasets. The project page is available on this website ~\url{https://github.com/badripatro/Simba}.
☆ Optimal Exploration Strategy for Regret Minimization in Unconstrained Scalar Optimization Problems
We study the problem of determining the optimal exploration strategy in an unconstrained scalar optimization problem depending on an unknown parameter to be learned from online collected noisy data. An optimal trade-off between exploration and exploitation is crucial for effective optimization under uncertainties, and to achieve this we consider a cumulative regret minimization approach over a finite horizon, with each time instant in the horizon characterized by a stochastic exploration signal, whose variance has to be designed. In this setting, under an idealized assumption on an appropriately defined information function associated with the excitation, we are able to show that the optimal exploration strategy is either to use no exploration at all (called lazy exploration) or adding an exploration excitation only at the first time instant of the horizon (called immediate exploration). A quadratic numerical example is used to illustrate the results.
comment: Preprint submitted to IEEE Conference on Decision and Control (CDC 2024)
☆ Optimal Data-Driven Prediction and Predictive Control using Signal Matrix Models
Data-driven control uses a past signal trajectory to characterise the input-output behaviour of a system. Willems' lemma provides a data-based prediction model allowing a control designer to bypass the step of identifying a state-space or transfer function model. This paper provides a more parsimonious formulation of Willems' lemma that separates the model into initial condition matching and predictive control design parts. This avoids the need for regularisers in the predictive control problem that are found in other data-driven predictive control methods. It also gives a closed form expression for the optimal (minimum variance) unbiased predictor of the future output trajectory and applies it for predictive control. Simulation comparisons illustrate very good control performance.
comment: 7 pages, 3 figures. Submitted to IEEE Control Systems Society Letters and the 2024 Conference on Decision and Control
☆ Event-Triggered State Estimation Through Confidence Level
This paper considers the state estimation problem for discrete-time linear systems under event-triggered scheme. In order to improve performance, a novel event-triggered scheme based on confidence level is proposed using the chi-square distribution and mild regularity assumption. In terms of the novel event-triggered scheme, a minimum mean squared error (MMSE) state estimator is proposed using some results presented in this paper. Two algorithms for communication rate estimation of the proposed MMSE state estimator are developed where the first algorithm is based on information with one-step delay, and the second algorithm is based on information with two-step delay. The performance and effectiveness of the proposed MMSE state estimator and the two communication rate estimation algorithms are illustrated using a target tracking scenario.
☆ Control contraction metrics on Lie groups
In this paper, we extend the control contraction metrics (CCM) approach, which was originally proposed for the universal tracking control of nonlinear systems, to those that evolves on Lie groups. Our idea is to view the manifold as a constrained set that is embedded in Euclidean space, and then propose the sufficient conditions for the existence of a CCM and the associated controller design. Notably, we demonstrate that the search for CCM on Lie groups can be reformulated as convex conditions. The results extend the applicability of the CCM approach and provide a framework for analyzing the behavior of control systems with Lie group structures.
☆ On moment relaxations for linear state feedback controller synthesis with non-convex quadratic costs and constraints
We present a simple and effective way to account for non-convex costs and constraints~in~state feedback synthesis, and an interpretation for the variables in which state feedback synthesis is typically convex. We achieve this by deriving the controller design using moment matrices of state and input. It turns out that this approach allows the consideration of non-convex constraints by relaxing them as expectation constraints, and that the variables in which state feedback synthesis is typically convexified can be identified with blocks of these moment matrices.
comment: Preprent to be submitted to IEEE Conference on Decision and Control
☆ Robust optimization for adversarial learning with finite sample complexity guarantees
Decision making and learning in the presence of uncertainty has attracted significant attention in view of the increasing need to achieve robust and reliable operations. In the case where uncertainty stems from the presence of adversarial attacks this need is becoming more prominent. In this paper we focus on linear and nonlinear classification problems and propose a novel adversarial training method for robust classifiers, inspired by Support Vector Machine (SVM) margins. We view robustness under a data driven lens, and derive finite sample complexity bounds for both linear and non-linear classifiers in binary and multi-class scenarios. Notably, our bounds match natural classifiers' complexity. Our algorithm minimizes a worst-case surrogate loss using Linear Programming (LP) and Second Order Cone Programming (SOCP) for linear and non-linear models. Numerical experiments on the benchmark MNIST and CIFAR10 datasets show our approach's comparable performance to state-of-the-art methods, without needing adversarial examples during training. Our work offers a comprehensive framework for enhancing binary linear and non-linear classifier robustness, embedding robustness in learning under the presence of adversaries.
☆ Forecasting the load of Parcel Pickup Points using a Markov Jump Process
The growth of e-commerce has resulted in a surge in parcel deliveries, increasing transportation costs and pollution issues. Alternatives to home delivery have emerged, such as the delivery to so-called parcel pick-up points (PUPs), which eliminates delivery failure due to customers not being at home. Nevertheless, parcels reaching overloaded PUPs may need to be redirected to alternative PUPs, sometimes far from the chosen ones, which may generate customer dissatisfaction. Consequently, predicting the PUP load is critical for a PUP management company to infer the availability of PUPs for future orders and better balance parcel flows between PUPs. This paper proposes a new approach to forecasting the PUP load evolution using a Markov jump process that models the parcel life cycle. The latest known status of each parcel is considered to estimate its contribution to the future load of its target PUP. This approach can account for the variability of activity, the various parcel preparation delays by sellers, and the diversity of parcel carriers that may result in different delivery delays. Here, results are provided for predicting the load associated with parcels ordered from online retailers by customers (Business-to-Customer, B2C). The proposed approach is generic and can also be applied to other parcel flows to PUPs, such as second-hand products (Customer-to-Customer, C2C) sent via a PUP network.
☆ Pursuit-Evasion on a Sphere and When It Can Be Considered Flat
In classical works on a planar differential pursuit-evasion game with a faster pursuer, the intercept point resulting from the equilibrium strategies lies on the Apollonius circle. This property was exploited for the construction of the equilibrium strategies for two faster pursuers against one evader. Extensions for planar multiple-pursuer single-evader scenarios have been considered. We study a pursuit-evasion game on a sphere and the relation of the equilibrium intercept point to the Apollonius domain on the sphere. The domain is a generalization of the planar Apollonius circle set. We find a condition resulting in the intercept point belonging to the Apollonius domain, which is the characteristic of the planar game solution. Finally, we use this characteristic to discuss pursuit and evasion strategies in the context of two pursuers and a single slower evader on the sphere and illustrate it using numerical simulations.
comment: 8 Pages, 5 figures, To be submitted to 2024 Conference on Decision and Control in Milan, Italy
☆ Infrastructure-Assisted Collaborative Perception in Automated Valet Parking: A Safety Perspective
Environmental perception in Automated Valet Parking (AVP) has been a challenging task due to severe occlusions in parking garages. Although Collaborative Perception (CP) can be applied to broaden the field of view of connected vehicles, the limited bandwidth of vehicular communications restricts its application. In this work, we propose a BEV feature-based CP network architecture for infrastructure-assisted AVP systems. The model takes the roadside camera and LiDAR as optional inputs and adaptively fuses them with onboard sensors in a unified BEV representation. Autoencoder and downsampling are applied for channel-wise and spatial-wise dimension reduction, while sparsification and quantization further compress the feature map with little loss in data precision. Combining these techniques, the size of a BEV feature map is effectively compressed to fit in the feasible data rate of the NR-V2X network. With the synthetic AVP dataset, we observe that CP can effectively increase perception performance, especially for pedestrians. Moreover, the advantage of infrastructure-assisted CP is demonstrated in two typical safety-critical scenarios in the AVP setting, increasing the maximum safe cruising speed by up to 3m/s in both scenarios.
comment: 7 pages, 7 figures, 4 tables, accepted by IEEE VTC2024-Spring
☆ Hybrid integrator-gain system based integral resonant controllers for negative imaginary systems
We introduce a hybrid control system called a hybrid integrator-gain system (HIGS) based integral resonant controller (IRC) to stabilize negative imaginary (NI) systems. A HIGS-based IRC has a similar structure to an IRC, with the integrator replaced by a HIGS. We show that a HIGS-based IRC is an NI system. Also, for a SISO NI system with a minimal realization, we show there exists a HIGS-based IRC such that their closed-loop interconnection is asymptotically stable. Also, we propose a proportional-integral-double-integral resonant controller and a HIGS-based proportional-integral-double-integral resonant controller and show that both of them can be applied to asymptotically stabilize an NI system. An example is provided to illustrate the proposed results.
comment: 9 pages, 9 figures
☆ Collision Avoidance Safety Filter for an Autonomous E-Scooter using Ultrasonic Sensors
In this paper, we propose a collision avoidance safety filter for autonomous electric scooters to enable safe operation of such vehicles in pedestrian areas. In particular, we employ multiple low-cost ultrasonic sensors to detect a wide range of possible obstacles in front of the e-scooter. Based on possibly faulty distance measurements, we design a filter to mitigate measurement noise and missing values as well as a gain-scheduled controller to limit the velocity commanded to the e-scooter when required due to imminent collisions. The proposed controller structure is able to prevent collisions with unknown obstacles by deploying a reduced safe velocity ensuring a sufficiently large safety distance. The collision avoidance approach is designed such that it may be easily deployed in similar applications of general micromobility vehicles. The effectiveness of our proposed safety filter is demonstrated in real-world experiments.
☆ Set-membership target search and tracking within an unknown cluttered area using cooperating UAVs equipped with vision systems
This paper addresses the problem of target search and tracking using a fleet of cooperating UAVs evolving in some unknown region of interest containing an a priori unknown number of moving ground targets. Each drone is equipped with an embedded Computer Vision System (CVS), providing an image with labeled pixels and a depth map of the observed part of its environment. Moreover, a box containing the corresponding pixels in the image frame is available when a UAV identifies a target. Hypotheses regarding information provided by the pixel classification, depth map construction, and target identification algorithms are proposed to allow its exploitation by set-membership approaches. A set-membership target location estimator is developed using the information provided by the CVS. Each UAV evaluates sets guaranteed to contain the location of the identified targets and a set possibly containing the locations of targets still to be identified. Then, each UAV uses these sets to search and track targets cooperatively.
☆ Improved Long Short-Term Memory-based Wastewater Treatment Simulators for Deep Reinforcement Learning
Even though Deep Reinforcement Learning (DRL) showed outstanding results in the fields of Robotics and Games, it is still challenging to implement it in the optimization of industrial processes like wastewater treatment. One of the challenges is the lack of a simulation environment that will represent the actual plant as accurately as possible to train DRL policies. Stochasticity and non-linearity of wastewater treatment data lead to unstable and incorrect predictions of models over long time horizons. One possible reason for the models' incorrect simulation behavior can be related to the issue of compounding error, which is the accumulation of errors throughout the simulation. The compounding error occurs because the model utilizes its predictions as inputs at each time step. The error between the actual data and the prediction accumulates as the simulation continues. We implemented two methods to improve the trained models for wastewater treatment data, which resulted in more accurate simulators: 1- Using the model's prediction data as input in the training step as a tool of correction, and 2- Change in the loss function to consider the long-term predicted shape (dynamics). The experimental results showed that implementing these methods can improve the behavior of simulators in terms of Dynamic Time Warping throughout a year up to 98% compared to the base model. These improvements demonstrate significant promise in creating simulators for biological processes that do not need pre-existing knowledge of the process but instead depend exclusively on time series data obtained from the system.
☆ Direct and Indirect Hydrogen Storage: Dynamics and Interactions in the Transition to a Renewable Energy Based System for Europe
To move towards a low-carbon society by 2050, understanding the intricate dynamics of energy systems is critical. Our study examines these interactions through the lens of hydrogen storage, dividing it into 'direct' and 'indirect' hydrogen storage. Direct hydrogen storage involves electrolysis-produced hydrogen being stored before use, while indirect storage first transforms hydrogen into gas via the Sabatier process for later energy distribution. Firstly, we utilize the PyPSA-Eur-Sec-30-path model to capture the interactions within the energy system. The model is an hour-level, one node per country system that encompasses a range of energy transformation technologies, outlining a pathway for Europe to reduce carbon emissions by 95 percent by 2050 compared to 1990, with updates every 5 years. Subsequently, we employ both quantitative and qualitative approaches to thoroughly analyze these complex relationships. Our research indicates that during the European green transition, cross-country flow of electricity will play an important role in Europe's rapid decarbonization stage before the large-scale introduction of energy storage. Under the paper cost assumptions, fuel cells are not considered a viable option. This research further identifies the significant impact of natural resource variability on the local energy mix, highlighting indirect hydrogen storage as a common solution due to the better economic performance and actively fluctuation pattern. Specifically, indirect hydrogen storage will contribute at least 60 percent of hydrogen storage benefits, reaching 100 percent in Italy. Moreover, its fluctuation pattern will change with the local energy structure, which is a distinct difference with the unchanged pattern of direct hydrogen storage and battery storage.
☆ A Twin Delayed Deep Deterministic Policy Gradient Algorithm for Autonomous Ground Vehicle Navigation via Digital Twin Perception Awareness
Autonomous ground vehicle (UGV) navigation has the potential to revolutionize the transportation system by increasing accessibility to disabled people, ensure safety and convenience of use. However, UGV requires extensive and efficient testing and evaluation to ensure its acceptance for public use. This testing are mostly done in a simulator which result to sim2real transfer gap. In this paper, we propose a digital twin perception awareness approach for the control of robot navigation without prior creation of the virtual environment (VT) environment state. To achieve this, we develop a twin delayed deep deterministic policy gradient (TD3) algorithm that ensures collision avoidance and goal-based path planning. We demonstrate the performance of our approach on different environment dynamics. We show that our approach is capable of efficiently avoiding collision with obstacles and navigating to its desired destination, while at the same time safely avoids obstacles using the information received from the LIDAR sensor mounted on the robot. Our approach bridges the gap between sim-to-real transfer and contributes to the adoption of UGVs in real world. We validate our approach in simulation and a real-world application in an office space.
comment: 8 pages, 7 figures
☆ Implementation of Firm-Dispatchable Generation in South Africa
South Africa is currently facing a critical situation in its power generation landscape, which is plagued by frequent power outages and the need to move from fossil fuels to renewable energy sources. This period emphasizes the importance of having firm-dispatchable power to balance out the intermittent nature of wind and solar energy sources. The paper proposes to repurpose old coal-fired power plants to generate firm-dispatchable energy in line with the principles of a Just Transition. Eskom's coal plants are approaching the end of their economic life, and their declining energy availability factor is becoming a challenge in meeting the country's energy needs. The study suggests that a comprehensive strategy that integrates wind, solar, and firm-dispatchable power can be cost-effective and reliable compared to the traditional coal-based approach or the nuclear alternative. The study emphasizes the necessity of a 25-year plan that would invest in flexible and modular dispatchable generation. It also highlights the strategic location of this generating capacity, including repurposing decommissioned coal plant sites. The proposed model integrates private investment, adheres to established best practices, and emphasizes adaptability to changing demand dynamics. The study provides a roadmap for enabling firm-dispatchable capacity for South Africa's energy transition, emphasizing economic prudence, environmental sustainability, and alignment with the principles of the Just Transition program.
☆ On the Solution Uniqueness of Data-Driven Modeling of Flexible Loads
This letter first explores the solution uniqueness of the data-driven modeling of price-responsive flexible loads (PFL). The PFL on the demand side is critical in modern power systems. An accurate PFL model is fundamental for system operations. Yet, whether the PFL model can be uniquely and correctly identified from operational data remains unclear. To address this, we analyze the structural and practical identifiability of the PFL model, deriving the condition for the solution uniqueness. Based on this, we point out the implications for selecting physical models of PFL to enhance the identification results. Numerical results validate this work.
☆ Linear Quadratic Guidance Law for Joint Motion Planning of a Pursuer-Turret Assembly
This paper presents joint motion planning of a vehicle with an attached rotating turret. The turret has a limited range as well as the field of view. The objective is capture a maneuvering target such that at the terminal time it is withing the field-of-view and range limits. Catering to it, we present a minimum effort guidance law that commensurate for the turn rate abilities of the vehicle and the turret. The guidance law is obtained using linearization about the collision triangle and admits an analytical solution. Simulation results are presented to exemplify the cooperation between the turret and the vehicle.
☆ Real-time Safety Index Adaptation for Parameter-varying Systems via Determinant Gradient Ascend
Safety Index Synthesis (SIS) is critical for deriving safe control laws. Recent works propose to synthesize a safety index (SI) via nonlinear programming and derive a safe control law such that the system 1) achieves forward invariant (FI) with some safe set and 2) guarantees finite time convergence (FTC) to that safe set. However, real-world system dynamics can vary during run-time, making the control law infeasible and invalidating the initial SI. Since the full SIS nonlinear programming is computationally expensive, it is infeasible to re-synthesize the SI each time the dynamics are perturbed. To address that, this paper proposes an efficient approach to adapting the SI to varying system dynamics and maintaining the feasibility of the safe control law. The proposed method leverages determinant gradient ascend and derives a closed-form update to safety index parameters, enabling real-time adaptation performance. A numerical study validates the effectiveness of our approach.
comment: Accepted to American Control Conference (ACC) 2024
☆ Data-Driven Predictive Control with Adaptive Disturbance Attenuation for Constrained Systems
In this paper, we propose a novel data-driven predictive control approach for systems subject to time-domain constraints. The approach combines the strengths of H-infinity control for rejecting disturbances and MPC for handling constraints. In particular, the approach can dynamically adapt H-infinity disturbance attenuation performance depending on measured system state and forecasted disturbance level to satisfy constraints. We establish theoretical properties of the approach including robust guarantees of closed-loop stability, disturbance attenuation, constraint satisfaction under noisy data, as well as sufficient conditions for recursive feasibility, and illustrate the approach with a numerical example.
comment: 11 pages, 2 figures
☆ Structured stability analysis of networked systems with uncertain links
An input-output approach to stability analysis is explored for networked systems with uncertain link dynamics. The main result consists of a collection of integral quadratic constraints, which together imply robust stability of the uncertain networked system, under the assumption that stability is achieved with ideal links. The conditions are decentralized inasmuch as each involves only agent and uncertainty model parameters that are local to a corresponding link. This makes the main result, which imposes no restriction on network structure, suitable for the study of large-scale systems.
☆ Network Learning with Directional Sign Patterns
Complex systems can be effectively modeled via graphs that encode networked interactions, where relations between entities or nodes are often quantified by signed edge weights, e.g., promotion/inhibition in gene regulatory networks, or encoding political of friendship differences in social networks. However, it is often the case that only an aggregate consequence of such edge weights that characterize relations may be directly observable, as in protein expression of in gene regulatory networks. Thus, learning edge weights poses a significant challenge that is further exacerbated for intricate and large-scale networks. In this article, we address a model problem to determine the strength of sign-indefinite relations that explain marginal distributions that constitute our data. To this end, we develop a paradigm akin to that of the Schr\"odinger bridge problem and an efficient Sinkhorn type algorithm (more properly, Schr\"odinger-Fortet-Sinkhorn algorithm) that allows fast convergence to parameters that minimize a relative entropy/likelihood criterion between the sought signed adjacency matrix and a prior. The formalism that we present represents a novel generalization of the earlier Schr\"odinger formalism in that marginal computations may incorporate weights that model directionality in underlying relations, and further, that it can be extended to high-order networks -- the Schr\"odinger-Fortet-Sinkhorn algorithm that we derive is applicable all the same and allows geometric convergence to a sought sign-indefinite adjacency matrix or tensor, for high-order networks. We demonstrate our framework with synthetic and real-world examples.
☆ Optimisation of photodetectors design: comparison between Montecarlo and Genetic Algorithms
We present Montecarlo and Genetic Algorithm optimisations applied to the design of photodetectors based on a transimpedance amplifier and a photodiode. The circuit performance is evaluated with a merit function and the systematic search method is used as a reference. The design parameters are the feedback network components and the photodiode bias voltage. To evaluate the optimisations, we define the relative difference between its merit and the optimum merit obtained by the systematic search. In both algorithms, the relative difference decreases with the number of evaluations, following a power law. The power-law exponent for the Genetic Algorithm is larger than that of Montecarlo (0.74 vs. 0.50). We conclude that both algorithms are advantageous compared to the systematic search method, and that the Genetic Algorithm shows a better performance than Montecarlo.
comment: 10 pages, 14 figures
♻ ☆ Empowering Autonomous Driving with Large Language Models: A Safety Perspective ICLR2024
Autonomous Driving (AD) encounters significant safety hurdles in long-tail unforeseen driving scenarios, largely stemming from the non-interpretability and poor generalization of the deep neural networks within the AD system, particularly in out-of-distribution and uncertain data. To this end, this paper explores the integration of Large Language Models (LLMs) into AD systems, leveraging their robust common-sense knowledge and reasoning abilities. The proposed methodologies employ LLMs as intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning, for enhancing driving performance and safety. We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine. Demonstrating superior performance and safety metrics compared to state-of-the-art approaches, our approach shows the promising potential for using LLMs for autonomous vehicles.
comment: Accepted to LLMAgent workshop @ICLR2024
♻ ☆ A Benchmark for the Application of Distributed Control Techniques to the Electricity Network of the European Economic Area
The European Economic Area Electricity Network Benchmark (EEA-ENB) is a multi-area power system representing the European network of transmission systems for electricity to facilitate the application of distributed control techniques. In the EEA-ENB we consider the Load Frequency Control (LFC) problem in the presence of renewable energy sources (RESs), and energy storage systems (ESSs). RESs are known to cause instability in power networks due to their inertia-less and intermittent characteristics, while ESSs are introduced as a resource to mitigate the problem. In the EEA-ENB, particular attention is dedicated to Distributed Model Predictive Control (DMPC), whose application is often limited to small and homogeneous test cases due to the lack of standardized large-scale scenarios for testing, and due to the large computation time required to obtain a centralized MPC action for performance comparison with DMPC strategies under consideration. The second problem is exacerbated when the scale of the system grows. To address these challenges and to provide a real-world-based and control-independent benchmark, the EEA-ENB has been developed. The benchmark includes a centralized MPC strategy providing performance and computation time metrics to compare distributed control within a repeatable and realistic simulation environment.
comment: Updated reference list with reference and DOI to software sources to run the benchmark
♻ ☆ Novelty Detection in Reinforcement Learning with World Models
Reinforcement learning (RL) using world models has found significant recent successes. However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline. We refer to the sudden change in visual properties or state transitions as novelties. Implementing novelty detection within generated world model frameworks is a crucial task for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents, by utilizing the misalignment of the world model's hallucinated states and the true observed states as an anomaly score. We provide effective approaches to detecting novelties in a distribution of transitions learned by an agent in a world model. Finally, we show the advantage of our work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RL focused novelty detection algorithms.
♻ ☆ Graph-structured tensor optimization for nonlinear density control and mean field games
In this work we develop a numerical method for solving a type of convex graph-structured tensor optimization problems. This type of problems, which can be seen as a generalization of multi-marginal optimal transport problems with graph-structured costs, appear in many applications. Examples are unbalanced optimal transport and multi-species potential mean field games, where the latter is a class of nonlinear density control problems. The method we develop is based on coordinate ascent in a Lagrangian dual, and under mild assumptions we prove that the algorithm converges globally. Moreover, under a set of stricter assumptions, the algorithm converges R-linearly. To perform the coordinate ascent steps one has to compute projections of the tensor, and doing so by brute force is in general not computationally feasible. Nevertheless, for certain graph structures it is possible to derive efficient methods for computing these projections, and here we specifically consider the graph structure that occurs in multi-species potential mean field games. We also illustrate the methodology on a numerical example from this problem class.
comment: 27 pages. Revision: among other things, the section on Convex dynamic network flow problems has been removed, and the section on Multi-species potential mean field games has been expanded
♻ ☆ Inferentialist Resource Semantics
In systems modelling, a system typically comprises located resources relative to which processes execute. One important use of logic in informatics is in modelling such systems for the purpose of reasoning (perhaps automated) about their behaviour and properties. To this end, one requires an interpretation of logical formulae in terms of the resources and states of the system; such an interpretation is called a resource semantics of the logic. This paper shows how inferentialism -- the view that meaning is given in terms of inferential behaviour -- enables a versatile and expressive framework for resource semantics. Specifically, how inferentialism seamlessly incorporates the assertion-based approach of the logic of Bunched Implications, foundational in program verification (e.g., as the basis of Separation Logic), and the renowned number-of-uses reading of Linear Logic. This integration enables reasoning about shared and separated resources in intuitive and familiar ways, as well as about the composition and interfacing of system components.
♻ ☆ Sparse Mean Field Load Balancing in Large Localized Queueing Systems
Scalable load balancing algorithms are of great interest in cloud networks and data centers, necessitating the use of tractable techniques to compute optimal load balancing policies for good performance. However, most existing scalable techniques, especially asymptotically scaling methods based on mean field theory, have not been able to model large queueing networks with strong locality. Meanwhile, general multi-agent reinforcement learning techniques can be hard to scale and usually lack a theoretical foundation. In this work, we address this challenge by leveraging recent advances in sparse mean field theory to learn a near-optimal load balancing policy in sparsely connected queueing networks in a tractable manner, which may be preferable to global approaches in terms of wireless communication overhead. Importantly, we obtain a general load balancing framework for a large class of sparse bounded-degree wireless topologies. By formulating a novel mean field control problem in the context of graphs with bounded degree, we reduce the otherwise difficult multi-agent problem to a single-agent problem. Theoretically, the approach is justified by approximation guarantees. Empirically, the proposed methodology performs well on several realistic and scalable wireless network topologies as compared to a number of well-known load balancing heuristics and existing scalable multi-agent reinforcement learning methods.
♻ ☆ A Control-Recoverable Added-Noise-based Privacy Scheme for LQ Control in Networked Control Systems
As networked control systems continue to evolve, ensuring the privacy of sensitive data becomes an increasingly pressing concern, especially in situations where the controller is physically separated from the plant. In this paper, we propose a secure control scheme for computing linear quadratic control in a networked control system utilizing two networked controllers, a privacy encoder and a control restorer. Specifically, the encoder generates two state signals blurred with random noise and sends them to the controllers, while the restorer reconstructs the correct control signal. The proposed design effectively preserves the privacy of the control system's state without sacrificing the control performance. We theoretically quantify the privacy-preserving performance in terms of the state estimation error of the controllers and the disclosure probability. Additionally, the proposed privacy-preserving scheme is also proven to satisfy differential privacy. Moreover, we extend the proposed privacy-preserving scheme and evaluation method to cases where collusion between two controllers occurs. Finally, we verify the validity of our proposed scheme through simulations.
♻ ☆ End-to-End Reinforcement Learning of Koopman Models for Economic Nonlinear Model Predictive Control
(Economic) nonlinear model predictive control ((e)NMPC) requires dynamic models that are sufficiently accurate and computationally tractable. Data-driven surrogate models for mechanistic models can reduce the computational burden of (e)NMPC; however, such models are typically trained by system identification for maximum prediction accuracy on simulation samples and perform suboptimally in (e)NMPC. We present a method for end-to-end reinforcement learning of Koopman surrogate models for optimal performance as part of (e)NMPC. We apply our method to two applications derived from an established nonlinear continuous stirred-tank reactor model. The controller performance is compared to that of (e)NMPCs utilizing models trained using system identification, and model-free neural network controllers trained using reinforcement learning. We show that the end-to-end trained models outperform those trained using system identification in (e)NMPC, and that, in contrast to the neural network controllers, the (e)NMPC controllers can react to changes in the control setting without retraining.
comment: manuscript (18 pages, 7 figures, 5 tables), supplementary materials (3 pages, 2 tables)
♻ ☆ Positive Competitive Networks for Sparse Reconstruction
We propose and analyze a continuous-time firing-rate neural network, the positive firing-rate competitive network (\pfcn), to tackle sparse reconstruction problems with non-negativity constraints. These problems, which involve approximating a given input stimulus from a dictionary using a set of sparse (active) neurons, play a key role in a wide range of domains, including for example neuroscience, signal processing, and machine learning. First, by leveraging the theory of proximal operators, we relate the equilibria of a family of continuous-time firing-rate neural networks to the optimal solutions of sparse reconstruction problems. Then, we prove that the \pfcn is a positive system and give rigorous conditions for the convergence to the equilibrium. Specifically, we show that the convergence: (i) only depends on a property of the dictionary; (ii) is linear-exponential, in the sense that initially the convergence rate is at worst linear and then, after a transient, it becomes exponential. We also prove a number of technical results to assess the contractivity properties of the neural dynamics of interest. Our analysis leverages contraction theory to characterize the behavior of a family of firing-rate competitive networks for sparse reconstruction with and without non-negativity constraints. Finally, we validate the effectiveness of our approach via a numerical example.
comment: 26 pages, 9 Figure, 1 Table
♻ ☆ Robust Direct Data-Driven Control for Probabilistic Systems
We propose a data-driven control method for systems with aleatoric uncertainty, for example, robot fleets with variations between agents. Our method leverages shared trajectory data to increase the robustness of the designed controller and thus facilitate transfer to new variations without the need for prior parameter and uncertainty estimations. In contrast to existing work on experience transfer for performance, our approach focuses on robustness and uses data collected from multiple realizations to guarantee generalization to unseen ones. Our method is based on scenario optimization combined with recent formulations for direct data-driven control. We derive lower bounds on the amount of data required to achieve quadratic stability for probabilistic systems with aleatoric uncertainty and demonstrate the benefits of our data-driven method through a numerical example. We find that the learned controllers generalize well to high variations in the dynamics even when based on only a few short open-loop trajectories. Robust experience transfer enables the design of safe and robust controllers that work out of the box without any additional learning during deployment.
♻ ☆ Performance Investigation of an Optimal Control Strategy for Zero-Emission Operations of Shipboard Microgrids SP
This work introduces an efficient power management approach for shipboard microgrids that integrates diesel generators, a fuel cell, and battery energy storage system. This strategy addresses both unit commitment and power dispatch, considering the zero-emission capability of the ship, as well as optimizing the ship's speed. The optimization is done through mixed integer linear programming with the objective of minimizing the operational cost of all the power resources. Evaluations are conducted on a notional all-electric ship, with electrical load simulated using a Markov chain based on actual measurement data. The findings underscore the effectiveness of the proposed strategy in optimizing fuel consumption while ensuring protection against blackout occurrences.
comment: Submitted to SPEEDAM 2024
♻ ☆ A Wind-Aware Path Planning Method for UAV-Asisted Bridge Inspection
In response to the gap in considering wind conditions in the bridge inspection using unmanned aerial vehicle (UAV) , this paper proposes a path planning method for UAVs that takes into account the influence of wind, based on the simulated annealing algorithm. The algorithm considers the wind factors, including the influence of different wind speeds and directions at the same time on the path planning of the UAV. Firstly, An environment model is constructed specifically for UAV bridge inspection, taking into account the various objective functions and constraint conditions of UAVs. A more sophisticated and precise mathematical model is then developed based on this environmental model to enable efficient and effective UAV path planning. Secondly, the bridge separation planning model is applied in a novel way, and a series of parameters are simulated, including the adjustment of the initial temperature value. The experimental results demonstrate that, compared with traditional local search algorithms, the proposed method achieves a cost reduction of 30.05\% and significantly improves effectiveness. Compared to path planning methods that do not consider wind factors, the proposed approach yields more realistic and practical results for UAV applications, as demonstrated by its improved effectiveness in simulations. These findings highlight the value of our method in facilitating more accurate and efficient UAV path planning in wind-prone environments.
comment: After carefully analysis, there is a bit design flaws in Algorithm 1. The experimental work of the paper is not comprehensive,which lacks an evaluation of the algorithm's running time
♻ ☆ DRAMS: Double-RIS Assisted Multihop Routing Scheme for Device-to-Device Communication
Reconfigurable intelligent surfaces (RISs) is a promising solution for enhancing the performance of multihop wireless communication networks. In this paper, we propose a double-RIS assisted multihop routing scheme for a device-to-device (D2D) communication network. Specifically, the scheme is dependent on the already deployed RISs and users in the surroundings. Besides the RISs, the emphasis of this work is to make more use of the existing intermediate users (IUs), which can act as relays. Hence, the density of RIS deployment in the surroundings can be reduced, which leads to the avoidance of resource wastage. However, we cannot solely depend on the IUs because this implies complete dependence on their availability for relaying and as a result, the aspect of reliability in terms of delay-constrained information transfer cannot be guaranteed. Moreover, the IUs are considered capable of energy harvesting and as a result, they do not waste their own energy in the process of volunteering to act as a relay for other users. Numerical results demonstrate the advantage of the proposed scheme over some existing approaches and lastly, useful insights related to the scheme design are also drawn, where we characterize the maximum acceptable delay at each hop under different set-ups.
comment: To appear in Elsevier Computer Communications
♻ ☆ An Optimal Control Framework for Influencing Human Driving Behavior in Mixed-Autonomy Traffic
As autonomous vehicles (AVs) become increasingly prevalent, their interaction with human drivers presents a critical challenge. Current AVs lack social awareness, causing behavior that is often awkward or unsafe. To combat this, social AVs, which are proactive rather than reactive in their behavior, have been explored in recent years. With knowledge of robot-human interaction dynamics, a social AV can influence a human driver to exhibit desired behaviors by strategically altering its own behaviors. In this paper, we present a novel framework for achieving human influence. The foundation of our framework lies in an innovative use of control barrier functions to formulate the desired objectives of influence as constraints in an optimal control problem. The computed controls gradually push the system state toward satisfaction of the objectives, e.g. slowing the human down to some desired speed. We demonstrate the proposed framework's feasibility in a variety of scenarios related to car-following and lane changes, including multi-robot and multi-human configurations. In two case studies, we validate the framework's effectiveness when applied to the problems of traffic flow optimization and aggressive behavior mitigation. Given these results, the main contribution of our framework is its versatility in a wide spectrum of influence objectives and mixed-autonomy configurations.
comment: Accepted to American Control Conference (ACC) 2024
♻ ☆ Learning Hierarchical Control For Multi-Agent Capacity-Constrained Systems
This paper introduces a novel data-driven hierarchical control scheme for managing a fleet of nonlinear, capacity-constrained autonomous agents in an iterative environment. We propose a control framework consisting of a high-level dynamic task assignment and routing layer and low-level motion planning and tracking layer. Each layer of the control hierarchy uses a data-driven Model Predictive Control (MPC) policy, maintaining bounded computational complexity at each calculation of a new task assignment or actuation input. We utilize collected data to iteratively refine estimates of agent capacity usage, and update MPC policy parameters accordingly. Our approach leverages tools from iterative learning control to integrate learning at both levels of the hierarchy, and coordinates learning between levels in order to maintain closed-loop feasibility and performance improvement of the connected architecture.
♻ ☆ A Survey of Predictive Maintenance: Systems, Purposes and Approaches
This paper highlights the importance of maintenance techniques in the coming industrial revolution, reviews the evolution of maintenance techniques, and presents a comprehensive literature review on the latest advancement of maintenance techniques, i.e., Predictive Maintenance (PdM), with emphasis on system architectures, optimization objectives, and optimization methods. In industry, any outages and unplanned downtime of machines or systems would degrade or interrupt a company's core business, potentially resulting in significant penalties and immeasurable reputation and economic loss. Existing traditional maintenance approaches, such as Reactive Maintenance (RM) and Preventive Maintenance (PM), suffer from high prevent and repair costs, inadequate or inaccurate mathematical degradation processes, and manual feature extraction. The incoming fourth industrial revolution is also demanding for a new maintenance paradigm to reduce the maintenance cost and downtime, and increase system availability and reliability. Predictive Maintenance (PdM) is envisioned the solution. In this survey, we first provide a high-level view of the PdM system architectures including PdM 4.0, Open System Architecture for Condition Based Monitoring (OSA-CBM), and cloud-enhanced PdM system. Then, we review the specific optimization objectives, which mainly comprise cost minimization, availability/reliability maximization, and multi-objective optimization. Furthermore, we present the optimization methods to achieve the aforementioned objectives, which include traditional Machine Learning (ML) based and Deep Learning (DL) based approaches. Finally, we highlight the future research directions that are critical to promote the application of DL techniques in the context of PdM.
comment: 38 pages, 23 figures
♻ ☆ Computationally Efficient Chance Constrained Covariance Control with Output Feedback
This paper studies the problem of developing computationally efficient solutions for steering the distribution of the state of a stochastic, linear dynamical system between two boundary Gaussian distributions in the presence of chance-constraints on the state and control input. It is assumed that the state is only partially available through a measurement model corrupted with noise. The filtered state is reconstructed with a Kalman filter, the chance constraints are reformulated as difference of convex (DC) constraints, and the resulting covariance control problem is reformulated as a DC program, which is solved using successive convexification. The efficiency of the proposed method is illustrated on a double integrator example with varying time horizons, and is compared to other state-of-the-art chance constrained covariance control methods.
comment: v2, submitted to CDC '24
♻ ☆ Coherent Equalization of Linear Quantum Systems
This paper introduces a $H_\infty$-like methodology of coherent filtering for equalization of passive linear quantum systems to help mitigate degrading effects of quantum communication channels. For such systems, which include a wide range of linear quantum optical devices and signals, we seek to find a near optimal equalizing filter which is itself a passive quantum system. The problem amounts to solving an optimization problem subject to constraints dictated by the requirement for the equalizer to be physically realizable. By formulating these constraints in the frequency domain, we show that the problem admits a convex $H_\infty$-like formulation. This allows us to derive a set of suboptimal coherent equalizers using $J$-spectral factorization. An additional semidefinite relaxation combined with the Nevanlinna-Pick interpolation is shown to lead to a tractable algorithm for the design of a near optimal coherent equalizer.
♻ ☆ Forced oscillation source localization from generator measurements
Malfunctioning equipment, erroneous operating conditions or periodic load variations can cause periodic disturbances that would persist over time, creating an undesirable transfer of energy across the system -- an effect referred to as forced oscillations. Wide-area oscillations may damage assets, trigger inadvertent tripping or control actions, and be the cause of equipment failure. Unfortunately, for wide-area oscillations, the location, frequency, and amplitude of these forced oscillations may be hard to determine. Recently, a data-driven maximum-likelihood-based method was proposed to perform source localization in transmission grids under wide-area response scenarios. However, this method relies on full PMU coverage and all buses having inertia and damping. Here, we extend this method to realistic scenarios which includes buses without inertia or dumping, such as passive loads and inverter-based generators. Incorporating Kron reduction directly into the maximum likelihood estimator, we are able to identify the location and frequency of forcing applied at both traditional generators and loads.
comment: 7 pages, 3 figures
Databases
☆ Fourier Transform-based Estimators for Data Sketches
In this paper we consider the problem of estimating the $f$-moment ($\sum_{v\in [n]} (f(\mathbf{x}(v))-f(0))$) of a dynamic vector $\mathbf{x}\in \mathbb{G}^n$ over some abelian group $(\mathbb{G},+)$, where the $\|f\|_\infty$ norm is bounded. We propose a simple sketch and new estimation framework based on the \emph{Fourier transform} of $f$. I.e., we decompose $f$ into a linear combination of homomorphisms $f_1,f_2,\ldots$ from $(\mathbb{G},+)$ to $(\mathbb{C},\times)$, estimate the $f_k$-moment for each $f_k$, and synthesize them to obtain an estimate of the $f$-moment. Our estimators are asymptotically unbiased and have variance asymptotic to $\|\mathbf{x}\|_0^2 (\|f\|_\infty^2 m^{-1} + \|\hat{f}\|_1^2 m^{-2})$, where the size of the sketch is $O(m\log n\log|\mathbb{G}|)$ bits. When $\mathbb{G}=\mathbb{Z}$ this problem can also be solved using off-the-shelf $\ell_0$-samplers with space $O(m\log^2 n)$ bits, which does not obviously generalize to finite groups. As a concrete benchmark, we extend Ganguly, Garofalakis, and Rastogi's singleton-detector-based sampler to work over $\mathbb{G}$ using $O(m\log n\log|\mathbb{G}|\log(m\log n))$ bits. We give some experimental evidence that the Fourier-based estimation framework is significantly more accurate than sampling-based approaches at the same memory footprint.
☆ ProvDeploy: Provenance-oriented Containerization of High Performance Computing Scientific Workflows
Many existing scientific workflows require High Performance Computing environments to produce results in a timely manner. These workflows have several software library components and use different environments, making the deployment and execution of the software stack not trivial. This complexity increases if the user needs to add provenance data capture services to the workflow. This manuscript introduces ProvDeploy to assist the user in configuring containers for scientific workflows with integrated provenance data capture. ProvDeploy was evaluated with a Scientific Machine Learning workflow, exploring containerization strategies focused on provenance in two distinct HPC environments
☆ Hydro: Adaptive Query Processing of ML Queries
Query optimization in relational database management systems (DBMSs) is critical for fast query processing. The query optimizer relies on precise selectivity and cost estimates to effectively optimize queries prior to execution. While this strategy is effective for relational DBMSs, it is not sufficient for DBMSs tailored for processing machine learning (ML) queries. In ML-centric DBMSs, query optimization is challenging for two reasons. First, the performance bottleneck of the queries shifts to user-defined functions (UDFs) that often wrap around deep learning models, making it difficult to accurately estimate UDF statistics without profiling the query. This leads to inaccurate statistics and sub-optimal query plans. Second, the optimal query plan for ML queries is data-dependent, necessitating DBMSs to adapt the query plan on the fly during execution. So, a static query plan is not sufficient for such queries. In this paper, we present Hydro, an ML-centric DBMS that utilizes adaptive query processing (AQP) for efficiently processing ML queries. Hydro is designed to quickly evaluate UDF-based query predicates by ensuring optimal predicate evaluation order and improving the scalability of UDF execution. By integrating AQP, Hydro continuously monitors UDF statistics, routes data to predicates in an optimal order, and dynamically allocates resources for evaluating predicates. We demonstrate Hydro's efficacy through four illustrative use cases, delivering up to 11.52x speedup over a baseline system.
☆ Efficiently Estimating Mutual Information Between Attributes Across Tables ICDE 2024
Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap or containment. However, the sheer number of tables obtained from these systems results in irrelevant joins that need to be performed; this can be computationally expensive or even infeasible in practice. We address this limitation by proposing the use of efficient mutual information (MI) estimation for finding relevant joinable tables. We introduce a new sketching method that enables efficient evaluation of relationship discovery queries by estimating MI without materializing the joins and returning a smaller set of tables that are more likely to be relevant. We also demonstrate the effectiveness of our approach at approximating MI in extensive experiments using synthetic and real-world datasets.
comment: Accepted to IEEE ICDE 2024
♻ ☆ Combining the Strengths of Dutch Survey and Register Data in a Data Challenge to Predict Fertility (PreFer)
The social sciences have produced an impressive body of research on determinants of fertility outcomes, or whether and when people have children. However, the strength of these determinants and underlying theories are rarely evaluated on their predictive ability on new data. This prevents us from systematically comparing studies, hindering the evaluation and accumulation of knowledge. In this paper, we present two datasets which can be used to study the predictability of fertility outcomes in the Netherlands. One dataset is based on the LISS panel, a longitudinal survey which includes thousands of variables on a wide range of topics, including individual preferences and values. The other is based on the Dutch register data which lacks attitudinal data but includes detailed information about the life courses of millions of Dutch residents. We provide information about the datasets and the samples, and describe the fertility outcome of interest. We also introduce the fertility prediction data challenge PreFer which is based on these datasets and will start in Spring 2024. We outline the ways in which measuring the predictability of fertility outcomes using these datasets and combining their strengths in the data challenge can advance our understanding of fertility behaviour and computational social science. We further provide details for participants on how to take part in the data challenge.
♻ ☆ Gen-T: Table Reclamation in Data Lakes ICDE 2024
We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.
comment: to appear at ICDE 2024
Emerging Technologies
☆ Cross-layer Modeling and Design of Content Addressable Memories in Advanced Technology Nodes for Similarity Search
In this paper we present a comprehensive design and benchmarking study of Content Addressable Memory (CAM) at the 7nm technology node in the context of similarity search applications. We design CAM cells based on SRAM, spin-orbit torque, and ferroelectric field effect transistor devices and from their layouts extract cell parasitics using state of the art EDA tools. These parasitics are used to develop SPICE netlists to model search operations. We use a CAM-based dataset search and a sequential recommendation system to highlight the application-level performance degradation due to interconnect parasitics. We propose and evaluate two solutions to mitigate interconnect effects.
comment: 7 pages, 5 figures
☆ Solving a Real-World Package Delivery Routing Problem Using Quantum Annealers
Research focused on the conjunction between quantum computing and routing problems has been very prolific in recent years. Most of the works revolve around classical problems such as the Traveling Salesman Problem or the Vehicle Routing Problem. Even though working on these problems is valuable, it is also undeniable that their academic-oriented nature falls short of real-world requirements. The main objective of this research is to present a solving method for realistic instances, avoiding problem relaxations or technical shortcuts. Instead, a quantum-classical hybrid solver has been developed, coined Q4RPD, that considers a set of real constraints such as a heterogeneous fleet of vehicles, priority deliveries, and capacities characterized by two values: weight and dimensions of the packages. Q4RPD resorts to the Leap Constrained Quadratic Model Hybrid Solver of D-Wave. To demonstrate the application of Q4RPD, an experimentation composed of six different instances has been conducted, aiming to serve as illustrative examples.
comment: 15 pages, 11 figures and 4 tables. Paper submitted for review in Scientific Reports
♻ ☆ LLMR: Real-time Prompting of Interactive Worlds using Large Language Models CHI 2024
We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.
comment: 46 pages, 18 figures; Matching version accepted at CHI 2024
♻ ☆ No distributed quantum advantage for approximate graph coloring STOC 2024
We give an almost complete characterization of the hardness of $c$-coloring $\chi$-chromatic graphs with distributed algorithms, for a wide range of models of distributed computing. In particular, we show that these problems do not admit any distributed quantum advantage. To do that: 1) We give a new distributed algorithm that finds a $c$-coloring in $\chi$-chromatic graphs in $\tilde{\mathcal{O}}(n^{\frac{1}{\alpha}})$ rounds, with $\alpha = \bigl\lfloor\frac{c-1}{\chi - 1}\bigr\rfloor$. 2) We prove that any distributed algorithm for this problem requires $\Omega(n^{\frac{1}{\alpha}})$ rounds. Our upper bound holds in the classical, deterministic LOCAL model, while the near-matching lower bound holds in the non-signaling model. This model, introduced by Arfaoui and Fraigniaud in 2014, captures all models of distributed graph algorithms that obey physical causality; this includes not only classical deterministic LOCAL and randomized LOCAL but also quantum-LOCAL, even with a pre-shared quantum state. We also show that similar arguments can be used to prove that, e.g., 3-coloring 2-dimensional grids or $c$-coloring trees remain hard problems even for the non-signaling model, and in particular do not admit any quantum advantage. Our lower-bound arguments are purely graph-theoretic at heart; no background on quantum information theory is needed to establish the proofs.
comment: Accepted to STOC 2024
♻ ☆ Hardware in Loop Learning with Spin Stochastic Neurons
Despite the promise of superior efficiency and scalability, real-world deployment of emerging nanoelectronic platforms for brain-inspired computing have been limited thus far, primarily because of inter-device variations and intrinsic non-idealities. In this work, we demonstrate mitigating these issues by performing learning directly on practical devices through a hardware-in-loop approach, utilizing stochastic neurons based on heavy metal/ferromagnetic spin-orbit torque heterostructures. We characterize the probabilistic switching and device-to-device variability of our fabricated devices of various sizes to showcase the effect of device dimension on the neuronal dynamics and its consequent impact on network-level performance. The efficacy of the hardware-in-loop scheme is illustrated in a deep learning scenario achieving equivalent software performance. This work paves the way for future large-scale implementations of neuromorphic hardware and realization of truly autonomous edge-intelligent devices.
♻ ☆ Depth-Optimal Addressing of 2D Qubit Array with 1D Controls Based on Exact Binary Matrix Factorization
Reducing control complexity is essential for achieving large-scale quantum computing. However, reducing control knobs may compromise the ability to independently address each qubit. Recent progress in neutral atom-based platforms suggests that rectangular (row-column) addressing may strike a balance between control granularity and flexibility for 2D qubit arrays. This scheme allows addressing qubits on the intersections of a set of rows and columns each time. While quadratically reducing controls, it may necessitate more depth. We formulate the depth-optimal rectangular addressing problem as exact binary matrix factorization, an NP-hard problem also appearing in communication complexity and combinatorial optimization. We introduce a satisfiability modulo theories-based solver for this problem, and a heuristic, row packing, performing close to the optimal solver on various benchmarks. Furthermore, we discuss rectangular addressing in the context of fault-tolerant quantum computing, leveraging a natural two-level structure.
Networking and Internet Architecture
☆ FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions
Reliable numerical computations are central to scientific computing, but the floating-point arithmetic that enables large-scale models is error-prone. Numeric exceptions are a common occurrence and can propagate through code, leading to flawed results. This paper presents FlowFPX, a toolkit for systematically debugging floating-point exceptions by recording their flow, coalescing exception contexts, and fuzzing in select locations. These tools help scientists discover when exceptions happen and track down their origin, smoothing the way to a reliable codebase.
comment: Presented at JuliaCon 2023; to appear in JuliaCon proceedings
Mathematical Software
☆ FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions
Reliable numerical computations are central to scientific computing, but the floating-point arithmetic that enables large-scale models is error-prone. Numeric exceptions are a common occurrence and can propagate through code, leading to flawed results. This paper presents FlowFPX, a toolkit for systematically debugging floating-point exceptions by recording their flow, coalescing exception contexts, and fuzzing in select locations. These tools help scientists discover when exceptions happen and track down their origin, smoothing the way to a reliable codebase.
comment: Presented at JuliaCon 2023; to appear in JuliaCon proceedings
Multiagent Systems
☆ FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions
Reliable numerical computations are central to scientific computing, but the floating-point arithmetic that enables large-scale models is error-prone. Numeric exceptions are a common occurrence and can propagate through code, leading to flawed results. This paper presents FlowFPX, a toolkit for systematically debugging floating-point exceptions by recording their flow, coalescing exception contexts, and fuzzing in select locations. These tools help scientists discover when exceptions happen and track down their origin, smoothing the way to a reliable codebase.
comment: Presented at JuliaCon 2023; to appear in JuliaCon proceedings
Human-Computer Interaction
☆ Extended Reality for Enhanced Human-Robot Collaboration: a Human-in-the-Loop Approach
The rise of automation has provided an opportunity to achieve higher efficiency in manufacturing processes, yet it often compromises the flexibility required to promptly respond to evolving market needs and meet the demand for customization. Human-robot collaboration attempts to tackle these challenges by combining the strength and precision of machines with human ingenuity and perceptual understanding. In this paper, we conceptualize and propose an implementation framework for an autonomous, machine learning-based manipulator that incorporates human-in-the-loop principles and leverages Extended Reality (XR) to facilitate intuitive communication and programming between humans and robots. Furthermore, the conceptual framework foresees human involvement directly in the robot learning process, resulting in higher adaptability and task generalization. The paper highlights key technologies enabling the proposed framework, emphasizing the importance of developing the digital ecosystem as a whole. Additionally, we review the existent implementation approaches of XR in human-robot collaboration, showcasing diverse perspectives and methodologies. The challenges and future outlooks are discussed, delving into the major obstacles and potential research avenues of XR for more natural human-robot interaction and integration in the industrial landscape.
☆ Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals
As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open questions and challenges that academia and industry should address to realize the vision of next-generation AI coding assistants.
☆ The Era of Semantic Decoding
Recent work demonstrated great promise in the idea of orchestrating collaborations between LLMs, human input, and various tools to address the inherent limitations of LLMs. We propose a novel perspective called semantic decoding, which frames these collaborative processes as optimization procedures in semantic space. Specifically, we conceptualize LLMs as semantic processors that manipulate meaningful pieces of information that we call semantic tokens (known thoughts). LLMs are among a large pool of other semantic processors, including humans and tools, such as search engines or code executors. Collectively, semantic processors engage in dynamic exchanges of semantic tokens to progressively construct high-utility outputs. We refer to these orchestrated interactions among semantic processors, optimizing and searching in semantic space, as semantic decoding algorithms. This concept draws a direct parallel to the well-studied problem of syntactic decoding, which involves crafting algorithms to best exploit auto-regressive language models for extracting high-utility sequences of syntactic tokens. By focusing on the semantic level and disregarding syntactic details, we gain a fresh perspective on the engineering of AI systems, enabling us to imagine systems with much greater complexity and capabilities. In this position paper, we formalize the transition from syntactic to semantic tokens as well as the analogy between syntactic and semantic decoding. Subsequently, we explore the possibilities of optimizing within the space of semantic tokens via semantic decoding algorithms. We conclude with a list of research opportunities and questions arising from this fresh perspective. The semantic decoding perspective offers a powerful abstraction for search and optimization directly in the space of meaningful concepts, with semantic tokens as the fundamental units of a new type of computation.
comment: 25 pages, 3 figures
☆ Looking Together $\neq$ Seeing the Same Thing: Understanding Surgeons' Visual Needs During Intra-operative Coordination and Instruction
Shared gaze visualizations have been found to enhance collaboration and communication outcomes in diverse HCI scenarios including computer supported collaborative work and learning contexts. Given the importance of gaze in surgery operations, especially when a surgeon trainer and trainee need to coordinate their actions, research on the use of gaze to facilitate intra-operative coordination and instruction has been limited and shows mixed implications. We performed a field observation of 8 surgeries and an interview study with 14 surgeons to understand their visual needs during operations, informing ways to leverage and augment gaze to enhance intra-operative coordination and instruction. We found that trainees have varying needs in receiving visual guidance which are often unfulfilled by the trainers' instructions. It is critical for surgeons to control the timing of the gaze-based visualizations and effectively interpret gaze data. We suggest overlay technologies, e.g., gaze-based summaries and depth sensing, to augment raw gaze in support of surgical coordination and instruction.
☆ Changing human's impression of empathy from agent by verbalizing agent's position
As anthropomorphic agents (AI and robots) are increasingly used in society, empathy and trust between people and agents are becoming increasingly important. A better understanding of agents by people will help to improve the problems caused by the future use of agents in society. In the past, there has been a focus on the importance of self-disclosure and the relationship between agents and humans in their interactions. In this study, we focused on the attributes of self-disclosure and the relationship between agents and people. An experiment was conducted to investigate hypotheses on trust and empathy with agents through six attributes of self-disclosure (opinions and attitudes, hobbies, work, money, personality, and body) and through competitive and cooperative relationships before a robotic agent performs a joint task. The experiment consisted of two between-participant factors: six levels of self-disclosure attributes and two levels of relationship with the agent. The results showed that the two factors had no effect on trust in the agent, but there was statistical significance for the attribute of self-disclosure regarding a person's empathy toward the agent. In addition, statistical significance was found regarding the agent's ability to empathize with a person as perceived by the person only in the case where the type of relationship, competitive or cooperative, was presented. The results of this study could lead to an effective method for building relationships with agents, which are increasingly used in society.
comment: 8 pages, 5 figures, 4 tables, submitted RO-MAN2024. arXiv admin note: text overlap with arXiv:2306.09447
☆ Dynamic Explanation Emphasis in Human-XAI Interaction with Communication Robot
Communication robots have the potential to contribute to effective human-XAI interaction as an interface that goes beyond textual or graphical explanations. One of their strengths is that they can use physical and vocal expressions to add detailed nuances to explanations. However, it is not clear how a robot can apply such expressions, or in particular, how we can develop a strategy to adaptively use such expressions depending on the task and user in dynamic interactions. To address this question, this paper proposes DynEmph, a method for a communication robot to decide where to emphasize XAI-generated explanations with physical expressions. It predicts the effect of emphasizing certain points on a user and aims to minimize the expected difference between predicted user decisions and AI-suggested ones. DynEmph features a strategy for deciding where to emphasize in a data-driven manner, relieving engineers from the need to manually design a strategy. We further conducted experiments to investigate how emphasis selection strategies affect the performance of user decisions. The results suggest that, while a naive strategy (emphasizing explanations for an AI's most probable class) does not necessarily work better, DynEmph effectively guides users to better decisions under the condition that the performance of the AI suggestion is high.
☆ How Human-Centered Explainable AI Interface Are Designed and Evaluated: A Systematic Survey
Despite its technological breakthroughs, eXplainable Artificial Intelligence (XAI) research has limited success in producing the {\em effective explanations} needed by users. In order to improve XAI systems' usability, practical interpretability, and efficacy for real users, the emerging area of {\em Explainable Interfaces} (EIs) focuses on the user interface and user experience design aspects of XAI. This paper presents a systematic survey of 53 publications to identify current trends in human-XAI interaction and promising directions for EI design and development. This is among the first systematic survey of EI research.
☆ Detoxifying Large Language Models via Knowledge Editing
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments to compare knowledge editing approaches with previous baselines, indicating that knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxify approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.
comment: Ongoing work. Project website: https://zjunlp.github.io/project/SafeEdit Benchmark: https://huggingface.co/datasets/zjunlp/SafeEdit Code: https://github.com/zjunlp/EasyEdit
☆ Recourse for reclamation: Chatting with generative language models CHI
Researchers and developers increasingly rely on toxicity scoring to moderate generative language model outputs, in settings such as customer service, information retrieval, and content generation. However, toxicity scoring may render pertinent information inaccessible, rigidify or "value-lock" cultural norms, and prevent language reclamation processes, particularly for marginalized people. In this work, we extend the concept of algorithmic recourse to generative language models: we provide users a novel mechanism to achieve their desired prediction by dynamically setting thresholds for toxicity filtering. Users thereby exercise increased agency relative to interactions with the baseline system. A pilot study ($n = 30$) supports the potential of our proposed recourse mechanism, indicating improvements in usability compared to fixed-threshold toxicity-filtering of model outputs. Future work should explore the intersection of toxicity scoring, model controllability, user agency, and language reclamation processes -- particularly with regard to the bias that many communities encounter when interacting with generative language models.
comment: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA 2024)
☆ Tell Me What You Want (What You Really, Really Want): Addressing the Expectation Gap for Goal Conveyance from Humans to Robots
Conveying human goals to autonomous systems (AS) occurs both when the system is being designed and when it is being operated. The design-step conveyance is typically mediated by robotics and AI engineers, who must appropriately capture end-user requirements and concepts of operations, while the operation-step conveyance is mediated by the design, interfaces, and behavior of the AI. However, communication can be difficult during both these periods because of mismatches in the expectations and expertise of the end-user and the roboticist, necessitating more design cycles to resolve. We examine some of the barriers in communicating system design requirements, and develop an augmentation for applied cognitive task analysis (ACTA) methods, that we call robot task analysis (RTA), pertaining specifically to the development of autonomous systems. Further, we introduce a top-down view of an underexplored area of friction between requirements communication -- implied human expectations -- utilizing a collection of work primarily from experimental psychology and social sciences. We show how such expectations can be used in conjunction with task-specific expectations and the system design process for AS to improve design team communication, alleviate barriers to user rejection, and reduce the number of design cycles.
comment: Presented at the End-User Development for Human-Robot Interaction (EUD4HRI) workshop at HRI 2024
☆ From Perils to Possibilities: Understanding how Human (and AI) Biases affect Online Fora
Social media platforms are online fora where users engage in discussions, share content, and build connections. This review explores the dynamics of social interactions, user-generated contents, and biases within the context of social media analysis (analyzing works that use the tools offered by complex network analysis and natural language processing) through the lens of three key points of view: online debates, online support, and human-AI interactions. On the one hand, we delineate the phenomenon of online debates, where polarization, misinformation, and echo chamber formation often proliferate, driven by algorithmic biases and extreme mechanisms of homophily. On the other hand, we explore the emergence of online support groups through users' self-disclosure and social support mechanisms. Online debates and support mechanisms present a duality of both perils and possibilities within social media; perils of segregated communities and polarized debates, and possibilities of empathy narratives and self-help groups. This dichotomy also extends to a third perspective: users' reliance on AI-generated content, such as the ones produced by Large Language Models, which can manifest both human biases hidden in training sets and non-human biases that emerge from their artificial neural architectures. Analyzing interdisciplinary approaches, we aim to deepen the understanding of the complex interplay between social interactions, user-generated content, and biases within the realm of social media ecosystems.
☆ PeerGPT: Probing the Roles of LLM-based Peer Agents as Team Moderators and Participants in Children's Collaborative Learning CHI
In children's collaborative learning, effective peer conversations can significantly enhance the quality of children's collaborative interactions. The integration of Large Language Model (LLM) agents into this setting explores their novel role as peers, assessing impacts as team moderators and participants. We invited two groups of participants to engage in a collaborative learning workshop, where they discussed and proposed conceptual solutions to a design problem. The peer conversation transcripts were analyzed using thematic analysis. We discovered that peer agents, while managing discussions effectively as team moderators, sometimes have their instructions disregarded. As participants, they foster children's creative thinking but may not consistently provide timely feedback. These findings highlight potential design improvements and considerations for peer agents in both roles.
comment: To appear at CHI EA '24
☆ A Design Space for Intelligent and Interactive Writing Assistants CHI 2024
In our era of rapid technological advancement, the research landscape for writing assistants has become increasingly fragmented across various research communities. We seek to address this challenge by proposing a design space as a structured way to examine and explore the multidimensional space of intelligent and interactive writing assistants. Through a large community collaboration, we explore five aspects of writing assistants: task, user, technology, interaction, and ecosystem. Within each aspect, we define dimensions (i.e., fundamental components of an aspect) and codes (i.e., potential options for each dimension) by systematically reviewing 115 papers. Our design space aims to offer researchers and designers a practical tool to navigate, comprehend, and compare the various possibilities of writing assistants, and aid in the envisioning and design of new writing assistants.
comment: Published as a conference paper at CHI 2024
☆ Empowering Personalized Learning through a Conversation-based Tutoring System with Student Modeling CHI 2024
As the recent Large Language Models(LLM's) become increasingly competent in zero-shot and few-shot reasoning across various domains, educators are showing a growing interest in leveraging these LLM's in conversation-based tutoring systems. However, building a conversation-based personalized tutoring system poses considerable challenges in accurately assessing the student and strategically incorporating the assessment into teaching within the conversation. In this paper, we discuss design considerations for a personalized tutoring system that involves the following two key components: (1) a student modeling with diagnostic components, and (2) a conversation-based tutor utilizing LLM with prompt engineering that incorporates student assessment outcomes and various instructional strategies. Based on these design considerations, we created a proof-of-concept tutoring system focused on personalization and tested it with 20 participants. The results substantiate that our system's framework facilitates personalization, with particular emphasis on the elements constituting student modeling. A web demo of our system is available at http://rlearning-its.com.
comment: Accepted to ACM CHI 2024 LBW
☆ A Roadmap Towards Automated and Regulated Robotic Systems
The rapid development of generative technology opens up possibility for higher level of automation, and artificial intelligence (AI) embodiment in robotic systems is imminent. However, due to the blackbox nature of the generative technology, the generation of the knowledge and workflow scheme is uncontrolled, especially in a dynamic environment and a complex scene. This poses challenges to regulations in safety-demanding applications such as medical scenes. We argue that the unregulated generative processes from AI is fitted for low level end tasks, but intervention in the form of manual or automated regulation should happen post-workflow-generation and pre-robotic-execution. To address this, we propose a roadmap that can lead to fully automated and regulated robotic systems. In this paradigm, the high level policies are generated as structured graph data, enabling regulatory oversight and reusability, while the code base for lower level tasks is generated by generative models. Our approach aims the transitioning from expert knowledge to regulated action, akin to the iterative processes of study, practice, scrutiny, and execution in human tasks. We identify the generative and deterministic processes in a design cycle, where generative processes serve as a text-based world simulator and the deterministic processes generate the executable system. We propose State Machine Seralization Language (SMSL) to be the conversion point between text simulator and executable workflow control. From there, we analyze the modules involved based on the current literature, and discuss human in the loop. As a roadmap, this work identifies the current possible implementation and future work. This work does not provide an implemented system but envisions to inspire the researchers working on the direction in the roadmap. We implement the SMSL and D-SFO paradigm that serve as the starting point of the roadmap.
comment: 17 pages, 9 figures
☆ The opportunities and risks of large language models in mental health
Global rates of mental health concerns are rising and there is increasing realization that existing models of mental healthcare will not adequately expand to meet the demand. With the emergence of large language models (LLMs) has come great optimism regarding their promise to create novel, large-scale solutions to support mental health. Despite their nascence, LLMs have already been applied to mental health-related tasks. In this review, we summarize the extant literature on efforts to use LLMs to provide mental health education, assessment, and intervention and highlight key opportunities for positive impact in each area. We then highlight risks associated with LLMs application to mental health and encourage adoption of strategies to mitigate these risks. The urgent need for mental health support must be balanced with responsible development, testing, and deployment of mental health LLMs. Especially critical is ensuring that mental health LLMs are fine-tuned for mental health, enhance mental health equity, adhere to ethical standards, and that people, including those with lived experience with mental health concerns, are involved in all stages from development through deployment. Prioritizing these efforts will minimize potential harms to mental health and maximize the likelihood that LLMs will positively impact mental health globally.
comment: 12 pages, 2 tables, 4 figures
☆ Curvature Augmented Manifold Embedding and Learning
A new dimensional reduction (DR) and data visualization method, Curvature-Augmented Manifold Embedding and Learning (CAMEL), is proposed. The key novel contribution is to formulate the DR problem as a mechanistic/physics model, where the force field among nodes (data points) is used to find an n-dimensional manifold representation of the data sets. Compared with many existing attractive-repulsive force-based methods, one unique contribution of the proposed method is to include a non-pairwise force. A new force field model is introduced and discussed, inspired by the multi-body potential in lattice-particle physics and Riemann curvature in topology. A curvature-augmented force is included in CAMEL. Following this, CAMEL formulation for unsupervised learning, supervised learning, semi-supervised learning/metric learning, and inverse learning are provided. Next, CAMEL is applied to many benchmark datasets by comparing existing models, such as tSNE, UMAP, TRIMAP, and PacMap. Both visual comparison and metrics-based evaluation are performed. 14 open literature and self-proposed metrics are employed for a comprehensive comparison. Conclusions and future work are suggested based on the current investigation. Related code and demonstration are available on https://github.com/ymlasu/CAMEL for interested readers to reproduce the results and other applications.
☆ Identifying Challenges in Designing, Developing and Evaluating Data Visualizations for Large Displays IEEE VIS 2023
With the growth of data sizes, visualizing them becomes more complex. Desktop displays are insufficient for presenting and collaborating on complex data visualizations. Large displays could provide the necessary space to demo or present complex data visualizations. However, designing and developing visualizations for such displays pose distinct challenges. Identifying these challenges is essential for researchers, designers, and developers in the field of data visualization. In this study, we aim to gain insights into the challenges encountered by designers and developers when creating data visualizations for large displays. We conducted a series of semi-structured interviews with experts who had experience in large displays and, through affinity diagramming, categorized the challenges.
comment: 2 pages, IEEE VIS 2023, Poster session
☆ Sequential Decision-Making for Inline Text Autocomplete
Autocomplete suggestions are fundamental to modern text entry systems, with applications in domains such as messaging and email composition. Typically, autocomplete suggestions are generated from a language model with a confidence threshold. However, this threshold does not directly take into account the cognitive load imposed on the user by surfacing suggestions, such as the effort to switch contexts from typing to reading the suggestion, and the time to decide whether to accept the suggestion. In this paper, we study the problem of improving inline autocomplete suggestions in text entry systems via a sequential decision-making formulation, and use reinforcement learning to learn suggestion policies through repeated interactions with a target user over time. This formulation allows us to factor cognitive load into the objective of training an autocomplete model, through a reward function based on text entry speed. We acquired theoretical and experimental evidence that, under certain objectives, the sequential decision-making formulation of the autocomplete problem provides a better suggestion policy than myopic single-step reasoning. However, aligning these objectives with real users requires further exploration. In particular, we hypothesize that the objectives under which sequential decision-making can improve autocomplete systems are not tailored solely to text entry speed, but more broadly to metrics such as user satisfaction and convenience.
☆ Enhancing retrofit device adoption in social housing: evidence from two field experiments in Belgium
Energy efficient technologies are particularly important for social housing settings: they offer the potential to improve tenants' wellbeing through monetary savings and comfort, while reducing emissions of entire communities. Slow uptake of innovative energy technology in social housing has been associated with a lack of trust and the perceived risks of adoption. To counteract both, we designed a communication campaign for a retrofit technology for heating including social norms for technology adoption and concretely experienced benefits. We report two randomized controlled trials (RCT) in two different social housing communities in Belgium. In the first study, randomization was on housing block level: the communication led to significant higher uptake rates compared to the control group. In the second study randomization occurred on apartment level, again yielding a significant increase, when an interaction with housing blocks was considered. We discuss challenges of conducting randomized controlled trials in social housing communities.
☆ EEG decoding with conditional identification information SP
Decoding EEG signals is crucial for unraveling human brain and advancing brain-computer interfaces. Traditional machine learning algorithms have been hindered by the high noise levels and inherent inter-person variations in EEG signals. Recent advances in deep neural networks (DNNs) have shown promise, owing to their advanced nonlinear modeling capabilities. However, DNN still faces challenge in decoding EEG samples of unseen individuals. To address this, this paper introduces a novel approach by incorporating the conditional identification information of each individual into the neural network, thereby enhancing model representation through the synergistic interaction of EEG and personal traits. We test our model on the WithMe dataset and demonstrated that the inclusion of these identifiers substantially boosts accuracy for both subjects in the training set and unseen subjects. This enhancement suggests promising potential for improving for EEG interpretability and understanding of relevant identification features.
comment: Accepted by 6th International Conference on Advances in Signal Processing and Artificial Intelligence (ASPAI' 2024)
☆ Multi-Level Feedback Generation with Large Language Models for Empowering Novice Peer Counselors
Realistic practice and tailored feedback are key processes for training peer counselors with clinical skills. However, existing mechanisms of providing feedback largely rely on human supervision. Peer counselors often lack mechanisms to receive detailed feedback from experienced mentors, making it difficult for them to support the large number of people with mental health issues who use peer counseling. Our work aims to leverage large language models to provide contextualized and multi-level feedback to empower peer counselors, especially novices, at scale. To achieve this, we co-design with a group of senior psychotherapy supervisors to develop a multi-level feedback taxonomy, and then construct a publicly available dataset with comprehensive feedback annotations of 400 emotional support conversations. We further design a self-improvement method on top of large language models to enhance the automatic generation of feedback. Via qualitative and quantitative evaluation with domain experts, we demonstrate that our method minimizes the risk of potentially harmful and low-quality feedback generation which is desirable in such high-stakes scenarios.
♻ ☆ Intelligent Canvas: Enabling Design-Like Exploratory Visual Data Analysis with Generative AI through Rapid Prototyping, Iteration and Curation
Complex data analysis inherently seeks unexpected insights through exploratory visual analysis methods, transcending logical, step-by-step processing. However, existing interfaces such as notebooks and dashboards have limitations in exploration and comparison for visual data analysis. Addressing these limitations, we introduce a "design-like" intelligent canvas environment integrating generative AI into data analysis, offering rapid prototyping, iteration, and comparative visualization management. Our dual contributions include the integration of generative AI components into a canvas interface, and empirical findings from a user study (N=10) evaluating the effectiveness of the canvas interface.
♻ ☆ EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
comment: Project website: https://zjunlp.github.io/project/EasyInstruct Code: https://github.com/zjunlp/EasyInstruct Video: https://youtu.be/rfQOWYfziFo Demo: https://huggingface.co/spaces/zjunlp/EasyInstruct
♻ ☆ Exploring Human's Gender Perception and Bias toward Non-Humanoid Robots
As non-humanoid robots increasingly permeate various sectors, understanding their design implications for human acceptance becomes paramount. Despite their ubiquity, studies on how to improve human interaction are sparse. Our investigation, conducted through two surveys, addresses this gap. The first survey emphasizes non-humanoid robots and human perceptions about gender attributions, suggesting that both design and perceived gender influence acceptance. Survey 2 investigates the effects of varying gender cues on robot designs and their consequent impacts on human-robot interactions. Our findings highlighted that distinct gender cues can bolster or impede interaction comfort.
♻ ☆ Large Language Models for Multi-Modal Human-Robot Interaction
This paper presents an innovative large language model (LLM)-based robotic system for enhancing multi-modal human-robot interaction (HRI). Traditional HRI systems relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, our system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomics" for actions and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach.
comment: 10 pages, 6 figures
♻ ☆ Exploring Large Language Models to Facilitate Variable Autonomy for Human-Robot Teaming
In a rapidly evolving digital landscape autonomous tools and robots are becoming commonplace. Recognizing the significance of this development, this paper explores the integration of Large Language Models (LLMs) like Generative pre-trained transformer (GPT) into human-robot teaming environments to facilitate variable autonomy through the means of verbal human-robot communication. In this paper, we introduce a novel framework for such a GPT-powered multi-robot testbed environment, based on a Unity Virtual Reality (VR) setting. This system allows users to interact with robot agents through natural language, each powered by individual GPT cores. By means of OpenAI's function calling, we bridge the gap between unstructured natural language input and structure robot actions. A user study with 12 participants explores the effectiveness of GPT-4 and, more importantly, user strategies when being given the opportunity to converse in natural language within a multi-robot environment. Our findings suggest that users may have preconceived expectations on how to converse with robots and seldom try to explore the actual language and cognitive capabilities of their robot collaborators. Still, those users who did explore where able to benefit from a much more natural flow of communication and human-like back-and-forth. We provide a set of lessons learned for future research and technical implementations of similar systems.
comment: Frontiers in Robotics and AI, Variable Autonomy for Human-Robot Teaming
♻ ☆ Designing Digital Voting Systems for Citizens: Achieving Fairness and Legitimacy in Participatory Budgeting
Participatory Budgeting (PB) has evolved into a key democratic instrument for resource allocation in cities. Enabled by digital platforms, cities now have the opportunity to let citizens directly propose and vote on urban projects, using different voting input and aggregation rules. However, the choices cities make in terms of the rules of their PB have often not been informed by academic studies on voter behaviour and preferences. Therefore, this work presents the results of behavioural experiments where participants were asked to vote in a fictional PB setting. We identified approaches to designing PB voting that minimise cognitive load and enhance the perceived fairness and legitimacy of the digital process from the citizens' perspective. In our study, participants preferred voting input formats that are more expressive (like rankings and distributing points) over simpler formats (like approval voting). Participants also indicated a desire for the budget to be fairly distributed across city districts and project categories. Participants found the Method of Equal Shares voting rule to be fairer than the conventional Greedy voting rule. These findings offer actionable insights for digital governance, contributing to the development of fairer and more transparent digital systems and collective decision-making processes for citizens.
comment: Under review in ACM Digital Government: Research and Practice
♻ ☆ An Empathy-Based Sandbox Approach to Bridge the Privacy Gap among Attitudes, Goals, Knowledge, and Behaviors
Managing privacy to reach privacy goals is challenging, as evidenced by the privacy attitude-behavior gap. Mitigating this discrepancy requires solutions that account for both system opaqueness and users' hesitations in testing different privacy settings due to fears of unintended data exposure. We introduce an empathy-based approach that allows users to experience how privacy attributes may alter system outcomes in a risk-free sandbox environment from the perspective of artificially generated personas. To generate realistic personas, we introduce a novel pipeline that augments the outputs of large language models (e.g., GPT-4) using few-shot learning, contextualization, and chain of thoughts. Our empirical studies demonstrated the adequate quality of generated personas and highlighted the changes in privacy-related applications (e.g., online advertising) caused by different personas. Furthermore, users demonstrated cognitive and emotional empathy towards the personas when interacting with our sandbox. We offered design implications for downstream applications in improving user privacy literacy.
♻ ☆ Tur[k]ingBench: A Challenge Benchmark for Web Agents
Recent chatbots have demonstrated impressive ability to understand and communicate in raw-text form. However, there is more to the world than raw text. For example, humans spend long hours of their time on web pages, where text is intertwined with other modalities and tasks are accomplished in the form of various complex interactions. Can state-of-the-art multi-modal models generalize to such complex domains? To address this question, we introduce TurkingBench, a benchmark of tasks formulated as web pages containing textual instructions with multi-modal context. Unlike existing work which employs artificially synthesized web pages, here we use natural HTML pages that were originally designed for crowdsourcing workers for various annotation purposes. The HTML instructions of each task are also instantiated with various values (obtained from the crowdsourcing tasks) to form new instances of the task. This benchmark contains 32.2K instances distributed across 158 tasks. Additionally, to facilitate the evaluation on TurkingBench, we develop an evaluation framework that connects the responses of chatbots to modifications on web pages (modifying a text box, checking a radio, etc.). We evaluate the performance of state-of-the-art models, including language-only, vision-only, and layout-only models, and their combinations, on this benchmark. Our findings reveal that these models perform significantly better than random chance, yet considerable room exists for improvement. We hope this benchmark will help facilitate the evaluation and development of web-based agents.
♻ ☆ 3DA: Assessing 3D-Printed Electrodes for Measuring Electrodermal Activity
Electrodermal activity (EDA) reflects changes in skin conductance, which are closely tied to human psychophysiological states. For example, EDA sensors can assess stress, cognitive workload, arousal, or other measures tied to the sympathetic nervous system for interactive human-centered applications. Yet, current limitations involve the complex attachment and proper skin contact with EDA sensors. This paper explores the concept of 3D printing electrodes for EDA measurements, integrating sensors into arbitrary 3D-printed objects, alleviating the need for complex assembly and attachment. We examine the adaptation of conventional EDA circuits for 3D-printed electrodes, assessing different electrode shapes and their impact on the sensing accuracy. A user study (N=6) revealed that 3D-printed electrodes can measure EDA with similar accuracy, suggesting larger contact areas for improved precision. We derive design implications to facilitate the integration of EDA sensors into 3D-printed devices to foster diverse integration into everyday objects for prototyping physiological interfaces.
♻ ☆ FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction
Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically annotate user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. Specifically, AU4 (brow lowerer) is reflective of negative evaluations of the generated image whereas AU12 (lip corner puller) is reflective of positive evaluations. These can be useful in two ways. Firstly, we can automatically annotate user preferences between image pairs with substantial difference in these AU responses with an accuracy significantly outperforming state-of-the-art scoring models. Secondly, directly integrating the AU responses with the scoring models improves their consistency with human preferences. Finally, this method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.
Networking and Internet Architecture
☆ Extrapolating Solution Paths of Polynomial Homotopies towards Singularities with PHCpack and phcpy
A robust path tracker [Telen, Van Barel, Verschelde, SISC 2020] computes the radius of convergence of Newton's method, estimates the distance to the nearest path, and then applies Pad\'e approximants to predict the next point on the path. Apriori step size control is less sensitive to finely tuned tolerances than aposteriori step size control, and is therefore robust. Extrapolation methods are effective to accurately locate the singular points at the end of solution paths, as illustrated with phcpy, the scripting interface to PHCpack.
♻ ☆ CUQIpy: II. Computational uncertainty quantification for PDE-based inverse problems in Python
Inverse problems, particularly those governed by Partial Differential Equations (PDEs), are prevalent in various scientific and engineering applications, and uncertainty quantification (UQ) of solutions to these problems is essential for informed decision-making. This second part of a two-paper series builds upon the foundation set by the first part, which introduced CUQIpy, a Python software package for computational UQ in inverse problems using a Bayesian framework. In this paper, we extend CUQIpy's capabilities to solve PDE-based Bayesian inverse problems through a general framework that allows the integration of PDEs in CUQIpy, whether expressed natively or using third-party libraries such as FEniCS. CUQIpy offers concise syntax that closely matches mathematical expressions, streamlining the modeling process and enhancing the user experience. The versatility and applicability of CUQIpy to PDE-based Bayesian inverse problems are demonstrated on examples covering parabolic, elliptic and hyperbolic PDEs. This includes problems involving the heat and Poisson equations and application case studies in electrical impedance tomography and photo-acoustic tomography, showcasing the software's efficiency, consistency, and intuitive interface. This comprehensive approach to UQ in PDE-based inverse problems provides accessibility for non-experts and advanced features for experts.
comment: 38 pages, 12 figures
♻ ☆ CUQIpy: I. Computational uncertainty quantification for inverse problems in Python
This paper introduces CUQIpy, a versatile open-source Python package for computational uncertainty quantification (UQ) in inverse problems, presented as Part I of a two-part series. CUQIpy employs a Bayesian framework, integrating prior knowledge with observed data to produce posterior probability distributions that characterize the uncertainty in computed solutions to inverse problems. The package offers a high-level modeling framework with concise syntax, allowing users to easily specify their inverse problems, prior information, and statistical assumptions. CUQIpy supports a range of efficient sampling strategies and is designed to handle large-scale problems. Notably, the automatic sampler selection feature analyzes the problem structure and chooses a suitable sampler without user intervention, streamlining the process. With a selection of probability distributions, test problems, computational methods, and visualization tools, CUQIpy serves as a powerful, flexible, and adaptable tool for UQ in a wide selection of inverse problems. Part II of the series focuses on the use of CUQIpy for UQ in inverse problems with partial differential equations (PDEs).
comment: 37 pages, 11 figures
♻ ☆ A new open source framework for multiscale modeling of fibrous materials on heterogeneous supercomputers
This article presents MuMFiM, an open source application for multiscale modeling of fibrous materials on massively parallel computers. MuMFiM uses two scales to represent fibrous materials such as biological network materials (extracellular matrix, connective tissue, etc.). It is designed to make use of multiple levels of parallelism, including distributed parallelism of the macro and microscales as well as GPU accelerated data-parallelism of the microscale. Scaling results of the GPU accelerated microscale show that solving microscale problems concurrently on the GPU can lead to a 1000x speedup over the solution of a single RVE on the GPU. In addition, we show nearly optimal strong and weak scaling results of MuMFiM on up to 128 nodes of AiMOS (Rensselaer Polytechnic Institute) which is composed of IBM AC922 nodes with 6 Volta V100 GPU and 2 20 core Power 9 CPUs each. We also show how MuMFiM can be used to solve problems of interest to the broader engineering community, in particular providing an example of the facet capsule ligament (FCL) of the human spine undergoing uniaxial extension.
Social and Information Networks
☆ Dynamical importance and network perturbations
The leading eigenvalue $\lambda$ of the adjacency matrix of a graph exerts much influence on the behavior of dynamical processes on that graph. It is thus relevant to relate notions of the importance (specifically, centrality measures) of network structures to $\lambda$ and its associated eigenvector. We study a previously derived measure of edge importance known as "dynamical importance", which estimates how much $\lambda$ changes when one removes an edge from a graph or adds an edge to it. We examine the accuracy of this estimate for different network structures and compare it to the true change in $\lambda$ after an edge removal or edge addition. We then derive a first-order approximation of the change in the leading eigenvector. We also consider the effects of edge additions on Kuramoto dynamics on networks, and we express the Kuramoto order parameter in terms of dynamical importance. Through our analysis and computational experiments, we find that studying dynamical importance can improve understanding of the relationship between network perturbations and dynamical processes on networks.
comment: 12 pages
Language Models Can Reduce Asymmetry in Information Markets
This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.
☆ Random Graph Modeling: A survey of the concepts
Random graph (RG) models play a central role in the complex networks analysis. They help to understand, control, and predict phenomena occurring, for instance, in social networks, biological networks, the Internet, etc. Despite a large number of RG models presented in the literature, there are few concepts underlying them. Instead of trying to classify a wide variety of very dispersed models, we capture and describe concepts they exploit considering preferential attachment, copying principle, hyperbolic geometry, recursively defined structure, edge switching, Monte Carlo sampling, etc. We analyze RG models, extract their basic principles, and build a taxonomy of concepts they are based on. We also discuss how these concepts are combined in RG models and how they work in typical applications like benchmarks, null models, and data anonymization.
☆ Collecting Influencers: A Comparative Study of Online Network Crawlers
Online network crawling tasks require a lot of efforts for the researchers to collect the data. One of them is identification of important nodes, which has many applications starting from viral marketing to the prevention of disease spread. Various crawling algorithms has been suggested but their efficiency is not studied well. In this paper we compared six known crawlers on the task of collecting the fraction of the most influential nodes of graph. We analyzed crawlers behavior for four measures of node influence: node degree, k-coreness, betweenness centrality, and eccentricity. The experiments confirmed that greedy methods perform the best in many settings, but the cases exist when they are very inefficient.
☆ From Perils to Possibilities: Understanding how Human (and AI) Biases affect Online Fora
Social media platforms are online fora where users engage in discussions, share content, and build connections. This review explores the dynamics of social interactions, user-generated contents, and biases within the context of social media analysis (analyzing works that use the tools offered by complex network analysis and natural language processing) through the lens of three key points of view: online debates, online support, and human-AI interactions. On the one hand, we delineate the phenomenon of online debates, where polarization, misinformation, and echo chamber formation often proliferate, driven by algorithmic biases and extreme mechanisms of homophily. On the other hand, we explore the emergence of online support groups through users' self-disclosure and social support mechanisms. Online debates and support mechanisms present a duality of both perils and possibilities within social media; perils of segregated communities and polarized debates, and possibilities of empathy narratives and self-help groups. This dichotomy also extends to a third perspective: users' reliance on AI-generated content, such as the ones produced by Large Language Models, which can manifest both human biases hidden in training sets and non-human biases that emerge from their artificial neural architectures. Analyzing interdisciplinary approaches, we aim to deepen the understanding of the complex interplay between social interactions, user-generated content, and biases within the realm of social media ecosystems.
☆ Relaxed Clique Percolation and Disinformation-Resilient Domains for Social Commerce Networks
Must we trace and block all fake content in a social commerce network so that genuine users may enjoy fake-free information? Such efforts largely fail, because, as we get better at spam detection, spammers use the same advances for anti-detection. As a fundamentally new approach, we show that an online platform can aggregate and route user-generated content in a smart personalized way, which fosters and relies on "collective social responsibility". We introduce the notion of information aggregation domain, or simply, domain: composed for a given "central" node (user account), a domain is a connected set of nodes whose user-generated content is eligible to be used to meet the central node's information needs. Admitting malicious information sources - "bad citizen" nodes - into "good citizen" nodes' domains puts the good citizens at risk for disinformation attacks. We show how a platform can limit this risk by exploiting the social link structure between its nodes without the need to know which nodes are good or bad citizens. We introduce Relaxed Clique Percolation (RCP), a class of policies to compose personalized disinformation-resilient domains. Then, we define "RCP cores" and show how they can be used to efficiently compose resilient domains for all network nodes at once. Finally, we analyze the properties of RCP domains found in real-world social networks including Slashdot, Facebook, Flickr, and Yelp, to affirm that in practice, RCP domains turn out to be large and spatially diverse.
comment: 18 pages, 7 figures
♻ ☆ Can Generative AI Generate Breakthrough Ideas?
The surge in generative AI capabilities has affected sectors such as drug discovery and creative text generation, fueling widespread enthusiasm about its potential to revolutionize scientific discovery through efficient exploration of knowledge combinations. But is this belief well-founded? This belief is rooted in the recombinant growth theory, which posits that innovation accelerates when existing ideas are iteratively combined. However, the theory encounters two significant challenges in understanding the nature of breakthroughs. First, breakthroughs such as relativity replacing Newtonian physics drive progress through competition, because they are fundamentally substitutive of older ones. Second, the recombinant strategy often only generates different ideas rather than better ones. Building on these, our study indicates the limitation of combinatorial view of innovation and point to the role of idea competition rather than combination in advancing science, even in the age of AI. Our results suggest that breakthroughs occur when ideas compete, not when they combine, and that combining more ideas tends to result in smaller innovations. This challenges the combinatoric metaphor of innovation that has captivated academia for three decades and complements subsequent studies equating content novelty with transformative innovation. Policymakers and researchers should focus on fostering environments that encourage idea competition and the development of AI systems capable of generating novel, disruptive ideas.
comment: 2 figures
♻ ☆ Incentivizing News Consumption on Social Media Platforms Using Large Language Models and Realistic Bot Accounts
Polarization, declining trust, and wavering support for democratic norms are pressing threats to U.S. democracy. Exposure to verified and quality news may lower individual susceptibility to these threats and make citizens more resilient to misinformation, populism, and hyperpartisan rhetoric. This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment (from 1/19/2023 to 2/3/2023) on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. To further test differential effects by gender of the bots, treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our over-time intervention enhances the following of news media organization, the sharing and the liking of news content and the tweeting about politics and the liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control. Most of these results, however, were small in magnitude and confined to the already politically interested Twitter users, as indicated by their pre-treatment tweeting about politics. These findings have implications for social media and news organizations, and also offer direction for future work on how Large Language Models and other computational interventions can effectively enhance individual on-platform engagement with quality news and public affairs.
Distributed, Parallel, and Cluster Computing
☆ History-Independent Concurrent Objects
A data structure is called history independent if its internal memory representation does not reveal the history of operations applied to it, only its current state. In this paper we study history independence for concurrent data structures, and establish foundational possibility and impossibility results. We show that a large class of concurrent objects cannot be implemented from smaller base objects in a manner that is both wait-free and history independent; but if we settle for either lock-freedom instead of wait-freedom or for a weak notion of history independence, then at least one object in the class, multi-valued single-reader single-writer registers, can be implemented from smaller base objects, binary registers. On the other hand, using large base objects, we give a strong possibility result in the form of a universal construction: an object with $s$ possible states can be implemented in a wait-free, history-independent manner from compare-and-swap base objects that each have $O(s + 2^n)$ possible memory states, where $n$ is the number of processes in the system.
☆ Accelerating Time-to-Science by Streaming Detector Data Directly into Perlmutter Compute Nodes
Recent advancements in detector technology have significantly increased the size and complexity of experimental data, and high-performance computing (HPC) provides a path towards more efficient and timely data processing. However, movement of large data sets from acquisition systems to HPC centers introduces bottlenecks owing to storage I/O at both ends. This manuscript introduces a streaming workflow designed for an high data rate electron detector that streams data directly to compute node memory at the National Energy Research Scientific Computing Center (NERSC), thereby avoiding storage I/O. The new workflow deploys ZeroMQ-based services for data production, aggregation, and distribution for on-the-fly processing, all coordinated through a distributed key-value store. The system is integrated with the detector's science gateway and utilizes the NERSC Superfacility API to initiate streaming jobs through a web-based frontend. Our approach achieves up to a 14-fold increase in data throughput and enhances predictability and reliability compared to a I/O-heavy file-based transfer workflow. Our work highlights the transformative potential of streaming workflows to expedite data analysis for time-sensitive experiments.
☆ Adversary-Augmented Simulation to evaluate client-fairness on HyperLedger Fabric
This paper presents a novel adversary model specifically tailored to distributed systems, with the aim to asses the security of blockchain technologies. Building upon literature on adversarial assumptions and capabilities, we include classical notions of failure and communication models to classify and bind the use of adversarial actions. We focus on the effect of these actions on properties of distributed protocols. A significant effort of our research is the integration of this model into the Multi-Agent eXperimenter (MAX) framework. This integration enables realistic simulations of adversarial attacks on blockchain systems. In particular, we have simulated attacks violating a form of client-fairness on HyperLedger Fabric.
comment: 10 pages (2 pages of references), 8 figures
☆ On the Power of Quantum Distributed Proofs
Quantum nondeterministic distributed computing was recently introduced as dQMA (distributed quantum Merlin-Arthur) protocols by Fraigniaud, Le Gall, Nishimura and Paz (ITCS 2021). In dQMA protocols, with the help of quantum proofs and local communication, nodes on a network verify a global property of the network. Fraigniaud et al. showed that, when the network size is small, there exists an exponential separation in proof size between distributed classical and quantum verification protocols, for the equality problem, where the verifiers check if all the data owned by a subset of them are identical. In this paper, we further investigate and characterize the power of the dQMA protocols for various decision problems. First, we give a more efficient dQMA protocol for the equality problem with a simpler analysis. This is done by adding a symmetrization step on each node and exploiting properties of the permutation test, which is a generalization of the SWAP test. We also show a quantum advantage for the equality problem on path networks still persists even when the network size is large, by considering ``relay points'' between extreme nodes. Second, we show that even in a general network, there exist efficient dQMA protocols for the ranking verification problem, the Hamming distance problem, and more problems that derive from efficient quantum one-way communication protocols. Third, in a line network, we construct an efficient dQMA protocol for a problem that has an efficient two-party QMA communication protocol. Finally, we obtain the first lower bounds on the proof and communication cost of dQMA protocols. To prove a lower bound on the equality problem, we show any dQMA protocol with an entangled proof between nodes can be simulated with a dQMA protocol with a separable proof between nodes by using a QMA communication-complete problem introduced by Raz and Shpilka (CCC 2004).
comment: 49 pages
☆ Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances NSDI '24
Deep neural networks (DNNs) are becoming progressively large and costly to train. This paper aims to reduce DNN training costs by leveraging preemptible instances on modern clouds, which can be allocated at a much lower price when idle but may be preempted by the cloud provider at any time. Prior work that supports DNN training on preemptive instances employs a reactive approach to handling instance preemptions and allocations after their occurrence, which only achieves limited performance and scalability. We present Parcae, a system that enables cheap, fast, and scalable DNN training on preemptible instances by proactively adjusting the parallelization strategy of a DNN training job to adapt to predicted resource changes before instance preemptions and allocations really happen, which significantly reduces the cost of handling these events. Parcae optimizes liveput, a novel metric that measures the expected training throughput of a DNN job under various possible preemption scenarios. Compared to existing reactive, throughput-optimized systems, Parcae's proactive, live-optimized solution considers both the throughput of a job and its robustness under preemptions. To optimize liveput, Parcae supports lightweight instance migration and uses an availability predictor to forecast future preemptions. It then uses a liveput optimizer to discover an optimal strategy to parallelize DNN training under predicted preemptions. We evaluate Parcae on a variety of DNNs and preemption traces and show that Parcae outperforms existing spot-instance DNN training systems by up to 10$\times$. More importantly, Parcae achieves near-optimal performance for training large DNNs under frequent preemptions, in which case existing approaches cannot make any progress.
comment: NSDI '24
☆ Accelerating ViT Inference on FPGA through Static and Dynamic Pruning
Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically reduces the computation based on the input. Combining these two techniques should significantly reduce computation complexity and model size; however, naively integrating them results in irregular computation patterns, leading to significant accuracy drops and difficulties in hardware acceleration. Addressing the above challenges, we propose a comprehensive algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning -combining static weight pruning and dynamic token pruning. For algorithm design, we systematically combine a hardware-aware structured block-pruning method for pruning model parameters and a dynamic token pruning method for removing unimportant token vectors. Moreover, we design a novel training algorithm to recover the model's accuracy. For hardware design, we develop a novel hardware accelerator for executing the pruned model. The proposed hardware design employs multi-level parallelism with load balancing strategy to efficiently deal with the irregular computation pattern led by the two pruning approaches. Moreover, we develop an efficient hardware mechanism for efficiently executing the on-the-fly token pruning.
comment: FCCM 2024
☆ Loop Improvement: An Efficient Approach for Extracting Shared Features from Heterogeneous Data without Central Server
In federated learning, data heterogeneity significantly impacts performance. A typical solution involves segregating these parameters into shared and personalized components, a concept also relevant in multi-task learning. Addressing this, we propose "Loop Improvement" (LI), a novel method enhancing this separation and feature extraction without necessitating a central server or data interchange among participants. Our experiments reveal LI's superiority in several aspects: In personalized federated learning environments, LI consistently outperforms the advanced FedALA algorithm in accuracy across diverse scenarios. Additionally, LI's feature extractor closely matches the performance achieved when aggregating data from all clients. In global model contexts, employing LI with stacked personalized layers and an additional network also yields comparable results to combined client data scenarios. Furthermore, LI's adaptability extends to multi-task learning, streamlining the extraction of common features across tasks and obviating the need for simultaneous training. This approach not only enhances individual task performance but also achieves accuracy levels on par with classic multi-task learning methods where all tasks are trained simultaneously. LI integrates a loop topology with layer-wise and end-to-end training, compatible with various neural network models. This paper also delves into the theoretical underpinnings of LI's effectiveness, offering insights into its potential applications. The code is on https://github.com/axedge1983/LI
comment: 11 pages, 11 figures
☆ AI and Memory Wall
The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
comment: Published in IEEE Micro Journal
☆ Blockchain e Sistemas Distribuídos: conceitos básicos e implicações
Blockchain technology has emerged as a necessity for the decentralization of payment methods and transactions, but it has brought with it many properties of distributed systems that have made it a crucial technology for overcoming some of society's challenges, especially in the context of decentralizing services, transparency of information, availability, and security. Its architecture and communication methods, although possessing some complex nuances to understand, particularly for the lay audience in the field of distributed systems, protocols, and computer networks. In this article, we will explore some topics of distributed systems related to blockchain technology.
comment: in Portuguese language
☆ iSpLib: A Library for Accelerating Graph Neural Networks using Auto-tuned Sparse Operations
Core computations in Graph Neural Network (GNN) training and inference are often mapped to sparse matrix operations such as sparse-dense matrix multiplication (SpMM). These sparse operations are harder to optimize by manual tuning because their performance depends significantly on the sparsity of input graphs, GNN models, and computing platforms. To address this challenge, we present iSpLib, a PyTorch-based C++ library equipped with auto-tuned sparse operations. iSpLib expedites GNN training with a cache-enabled backpropagation that stores intermediate matrices in local caches. The library offers a user-friendly Python plug-in that allows users to take advantage of our optimized PyTorch operations out-of-the-box for any existing linear algebra-based PyTorch implementation of popular GNNs (Graph Convolution Network, GraphSAGE, Graph Inference Network, etc.) with only two lines of additional code. We demonstrate that iSpLib obtains up to 27x overall training speedup compared to the equivalent PyTorch 2.1.0 and PyTorch Geometric 2.4.0 implementations on the CPU. Our library is publicly available at https://github.com/HipGraph/iSpLib (https://doi.org/10.5281/zenodo.10806511).
☆ CASPER: Carbon-Aware Scheduling and Provisioning for Distributed Web Services
There has been a significant societal push towards sustainable practices, including in computing. Modern interactive workloads such as geo-distributed web-services exhibit various spatiotemporal and performance flexibility, enabling the possibility to adapt the location, time, and intensity of processing to align with the availability of renewable and low-carbon energy. An example is a web application hosted across multiple cloud regions, each with varying carbon intensity based on their local electricity mix. Distributed load-balancing enables the exploitation of low-carbon energy through load migration across regions, reducing web applications carbon footprint. In this paper, we present CASPER, a carbon-aware scheduling and provisioning system that primarily minimizes the carbon footprint of distributed web services while also respecting their Service Level Objectives (SLO). We formulate CASPER as an multi-objective optimization problem that considers both the variable carbon intensity and latency constraints of the network. Our evaluation reveals the significant potential of CASPER in achieving substantial reductions in carbon emissions. Compared to baseline methods, CASPER demonstrates improvements of up to 70% with no latency performance degradation.
☆ FedMef: Towards Memory-efficient Federated Dynamic Pruning CVPR2024
Federated learning (FL) promotes decentralized training while prioritizing data confidentiality. However, its application on resource-constrained devices is challenging due to the high demand for computation and memory resources to train deep learning models. Neural network pruning techniques, such as dynamic pruning, could enhance model efficiency, but directly adopting them in FL still poses substantial challenges, including post-pruning performance degradation, high activation memory usage, etc. To address these challenges, we propose FedMef, a novel and memory-efficient federated dynamic pruning framework. FedMef comprises two key components. First, we introduce the budget-aware extrusion that maintains pruning efficiency while preserving post-pruning performance by salvaging crucial information from parameters marked for pruning within a given budget. Second, we propose scaled activation pruning to effectively reduce activation memory footprints, which is particularly beneficial for deploying FL to memory-limited devices. Extensive experiments demonstrate the effectiveness of our proposed FedMef. In particular, it achieves a significant reduction of 28.5% in memory footprint compared to state-of-the-art methods while obtaining superior accuracy.
comment: Accepted by CVPR2024
♻ ☆ Task Graph offloading via Deep Reinforcement Learning in Mobile Edge Computing
Various mobile applications that comprise dependent tasks are gaining widespread popularity and are increasingly complex. These applications often have low-latency requirements, resulting in a significant surge in demand for computing resources. With the emergence of mobile edge computing (MEC), it becomes the most significant issue to offload the application tasks onto small-scale devices deployed at the edge of the mobile network for obtaining a high-quality user experience. However, since the environment of MEC is dynamic, most existing works focusing on task graph offloading, which rely heavily on expert knowledge or accurate analytical models, fail to fully adapt to such environmental changes, resulting in the reduction of user experience. This paper investigates the task graph offloading in MEC, considering the time-varying computation capabilities of edge computing devices. To adapt to environmental changes, we model the task graph scheduling for computation offloading as a Markov Decision Process (MDP). Then, we design a deep reinforcement learning algorithm (SATA-DRL) to learn the task scheduling strategy from the interaction with the environment, to improve user experience. Extensive simulations validate that SATA-DRL is superior to existing strategies in terms of reducing average makespan and deadline violation.
comment: 13 figures
♻ ☆ A new open source framework for multiscale modeling of fibrous materials on heterogeneous supercomputers
This article presents MuMFiM, an open source application for multiscale modeling of fibrous materials on massively parallel computers. MuMFiM uses two scales to represent fibrous materials such as biological network materials (extracellular matrix, connective tissue, etc.). It is designed to make use of multiple levels of parallelism, including distributed parallelism of the macro and microscales as well as GPU accelerated data-parallelism of the microscale. Scaling results of the GPU accelerated microscale show that solving microscale problems concurrently on the GPU can lead to a 1000x speedup over the solution of a single RVE on the GPU. In addition, we show nearly optimal strong and weak scaling results of MuMFiM on up to 128 nodes of AiMOS (Rensselaer Polytechnic Institute) which is composed of IBM AC922 nodes with 6 Volta V100 GPU and 2 20 core Power 9 CPUs each. We also show how MuMFiM can be used to solve problems of interest to the broader engineering community, in particular providing an example of the facet capsule ligament (FCL) of the human spine undergoing uniaxial extension.
Machine Learning
☆ MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io
comment: 46 Pages, Work in Progress, Benchmark Project Page: https://mathverse-cuhk.github.io
☆ Simplified Diffusion Schrödinger Bridge
This paper introduces a novel theoretical simplification of the Diffusion Schr\"odinger Bridge (DSB) that facilitates its unification with Score-based Generative Models (SGMs), addressing the limitations of DSB in complex data generation and enabling faster convergence and enhanced performance. By employing SGMs as an initial solution for DSB, our approach capitalizes on the strengths of both frameworks, ensuring a more efficient training process and improving the performance of SGM. We also propose a reparameterization technique that, despite theoretical approximations, practically improves the network's fitting capabilities. Our extensive experimental evaluations confirm the effectiveness of the simplified DSB, demonstrating its significant improvements. We believe the contributions of this work pave the way for advanced generative modeling. The code is available at https://github.com/tzco/Simplified-Diffusion-Schrodinger-Bridge.
☆ Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
☆ DreamReward: Text-to-3D Generation with Human Preference
3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D -- the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models.
comment: Project page: https://jamesyjl.github.io/DreamReward
☆ Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities. Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design. In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.
comment: 25 pages, 13 figures
☆ The Elements of Differentiable Programming
Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.
comment: Draft version 1
☆ ReNoise: Real Image Inversion Through Iterative Noising
Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities. However, applying these methods to real images necessitates the inversion of the images into the domain of the pretrained diffusion model. Achieving faithful inversion remains a challenge, particularly for more recent models trained to generate images with a small number of denoising steps. In this work, we introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations. Building on reversing the diffusion sampling process, our method employs an iterative renoising mechanism at each inversion sampling step. This mechanism refines the approximation of a predicted point along the forward diffusion trajectory, by iteratively applying the pretrained diffusion model, and averaging these predictions. We evaluate the performance of our ReNoise technique using various sampling algorithms and models, including recent accelerated diffusion models. Through comprehensive evaluations and comparisons, we show its effectiveness in terms of both accuracy and speed. Furthermore, we confirm that our method preserves editability by demonstrating text-driven image editing on real images.
comment: project page at: https://garibida.github.io/ReNoise-Inversion/
☆ Extended Reality for Enhanced Human-Robot Collaboration: a Human-in-the-Loop Approach
The rise of automation has provided an opportunity to achieve higher efficiency in manufacturing processes, yet it often compromises the flexibility required to promptly respond to evolving market needs and meet the demand for customization. Human-robot collaboration attempts to tackle these challenges by combining the strength and precision of machines with human ingenuity and perceptual understanding. In this paper, we conceptualize and propose an implementation framework for an autonomous, machine learning-based manipulator that incorporates human-in-the-loop principles and leverages Extended Reality (XR) to facilitate intuitive communication and programming between humans and robots. Furthermore, the conceptual framework foresees human involvement directly in the robot learning process, resulting in higher adaptability and task generalization. The paper highlights key technologies enabling the proposed framework, emphasizing the importance of developing the digital ecosystem as a whole. Additionally, we review the existent implementation approaches of XR in human-robot collaboration, showcasing diverse perspectives and methodologies. The challenges and future outlooks are discussed, delving into the major obstacles and potential research avenues of XR for more natural human-robot interaction and integration in the industrial landscape.
☆ Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery
Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning. This paper rethinks the two different angles of AIRL: policy imitation and transferable reward recovery. We begin with substituting the built-in algorithm in AIRL with soft actor-critic (SAC) during the policy optimization process to enhance sample efficiency, thanks to the off-policy formulation of SAC and identifiable Markov decision process (MDP) models with respect to AIRL. It indeed exhibits a significant improvement in policy imitation but accidentally brings drawbacks to transferable reward recovery. To learn this issue, we illustrate that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for satisfactory transfer effect. Additionally, we analyze the capability of environments to extract disentangled rewards from an algebraic theory perspective.
☆ ReAct Meets ActRe: Autonomous Annotations of Agent Trajectories for Contrastive Self-Training
Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotations or implementations of diverse prompting frameworks. In this work, we propose A$^3$T, a framework that enables the Autonomous Annotation of Agent Trajectories in the style of ReAct. The central role is an ActRe prompting agent, which explains the reason for an arbitrary action. When randomly sampling an external action, the ReAct-style agent could query the ActRe agent with the action to obtain its textual rationales. Novel trajectories are then synthesized by prepending the posterior reasoning from ActRe to the sampled action. In this way, the ReAct-style agent executes multiple trajectories for the failed tasks, and selects the successful ones to supplement its failed trajectory for contrastive self-training. Realized by policy gradient methods with binarized rewards, the contrastive self-training with accumulated trajectories facilitates a closed loop for multiple rounds of language agent self-improvement. We conduct experiments using QLoRA fine-tuning with the open-sourced Mistral-7B-Instruct-v0.2. In AlfWorld, the agent trained with A$^3$T obtains a 1-shot success rate of 96%, and 100% success with 4 iterative rounds. In WebShop, the 1-shot performance of the A$^3$T agent matches human average, and 4 rounds of iterative refinement lead to the performance approaching human experts. A$^3$T agents significantly outperform existing techniques, including prompting with GPT-4, advanced agent frameworks, and fully fine-tuned LLMs.
☆ An Analysis of Linear Time Series Forecasting Models
Despite their simplicity, linear models perform well at time series forecasting, even when pitted against deeper and more expensive models. A number of variations to the linear model have been proposed, often including some form of feature normalisation that improves model generalisation. In this paper we analyse the sets of functions expressible using these linear model architectures. In so doing we show that several popular variants of linear models for time series forecasting are equivalent and functionally indistinguishable from standard, unconstrained linear regression. We characterise the model classes for each linear variant. We demonstrate that each model can be reinterpreted as unconstrained linear regression over a suitably augmented feature set, and therefore admit closed-form solutions when using a mean-squared loss function. We provide experimental evidence that the models under inspection learn nearly identical solutions, and finally demonstrate that the simpler closed form solutions are superior forecasters across 72% of test settings.
☆ Co-Optimization of Environment and Policies for Decentralized Multi-Agent Navigation
This work views the multi-agent system and its surrounding environment as a co-evolving system, where the behavior of one affects the other. The goal is to take both agent actions and environment configurations as decision variables, and optimize these two components in a coordinated manner to improve some measure of interest. Towards this end, we consider the problem of decentralized multi-agent navigation in cluttered environments. By introducing two sub-objectives of multi-agent navigation and environment optimization, we propose an $\textit{agent-environment co-optimization}$ problem and develop a $\textit{coordinated algorithm}$ that alternates between these sub-objectives to search for an optimal synthesis of agent actions and obstacle configurations in the environment; ultimately, improving the navigation performance. Due to the challenge of explicitly modeling the relation between agents, environment and performance, we leverage policy gradient to formulate a model-free learning mechanism within the coordinated framework. A formal convergence analysis shows that our coordinated algorithm tracks the local minimum trajectory of an associated time-varying non-convex optimization problem. Extensive numerical results corroborate theoretical findings and show the benefits of co-optimization over baselines. Interestingly, the results also indicate that optimized environment configurations are able to offer structural guidance that is key to de-conflicting agents in motion.
☆ RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain ICLR 2024
Large Language Models (LLMs) increasingly support applications in a wide range of domains, some with potential high societal impact such as biomedicine, yet their reliability in realistic use cases is under-researched. In this work we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA) framework and evaluate whether four state-of-the-art foundation LLMs can serve as reliable assistants in the biomedical domain. We identify prompt robustness, high recall, and a lack of hallucinations as necessary criteria for this use case. We design shortform tasks and tasks requiring LLM freeform responses mimicking real-world user interactions. We evaluate LLM performance using semantic similarity with a ground truth response, through an evaluator LLM.
comment: Published at ICLR 2024 Workshop on Reliable and Responsible Foundation Models
☆ A survey on Concept-based Approaches For Model Improvement
The focus of recent research has shifted from merely increasing the Deep Neural Networks (DNNs) performance in various tasks to DNNs, which are more interpretable to humans. The field of eXplainable Artificial Intelligence (XAI) has observed various techniques, including saliency-based and concept-based approaches. Concept-based approaches explain the model's decisions in simple human understandable terms called Concepts. Concepts are human interpretable units of data and are the thinking ground of humans. Explanations in terms of concepts enable detecting spurious correlations, inherent biases, or clever-hans. With the advent of concept-based explanations, there have been various concept representation methods and automatic concept discovery algorithms. Some recent methods use concepts for post-hoc model disentanglement evaluation, while others use them for ante-hoc training. The concept-based approaches are new, with many representations coming up, and there is very limited work on Concept-based Model improvement. We provide a systematic review and taxonomy of various concept representations and their discovery algorithms in DNNs, specifically in vision. We also provide details on concept-based model improvement literature, which is the first to survey concept-based model improvement methods.
☆ Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.
☆ Estimating Physical Information Consistency of Channel Data Augmentation for Remote Sensing Images
The application of data augmentation for deep learning (DL) methods plays an important role in achieving state-of-the-art results in supervised, semi-supervised, and self-supervised image classification. In particular, channel transformations (e.g., solarize, grayscale, brightness adjustments) are integrated into data augmentation pipelines for remote sensing (RS) image classification tasks. However, contradicting beliefs exist about their proper applications to RS images. A common point of critique is that the application of channel augmentation techniques may lead to physically inconsistent spectral data (i.e., pixel signatures). To shed light on the open debate, we propose an approach to estimate whether a channel augmentation technique affects the physical information of RS images. To this end, the proposed approach estimates a score that measures the alignment of a pixel signature within a time series that can be naturally subject to deviations caused by factors such as acquisition conditions or phenological states of vegetation. We compare the scores associated with original and augmented pixel signatures to evaluate the physical consistency. Experimental results on a multi-label image classification task show that channel augmentations yielding a score that exceeds the expected deviation of original pixel signatures can not improve the performance of a baseline model trained without augmentation.
comment: Accepted at the IEEE International Geoscience and Remote Sensing Symposium
☆ Object-Centric Domain Randomization for 3D Shape Reconstruction in the Wild
One of the biggest challenges in single-view 3D shape reconstruction in the wild is the scarcity of <3D shape, 2D image>-paired data from real-world environments. Inspired by remarkable achievements via domain randomization, we propose ObjectDR which synthesizes such paired data via a random simulation of visual variations in object appearances and backgrounds. Our data synthesis framework exploits a conditional generative model (e.g., ControlNet) to generate images conforming to spatial conditions such as 2.5D sketches, which are obtainable through a rendering process of 3D shapes from object collections (e.g., Objaverse-XL). To simulate diverse variations while preserving object silhouettes embedded in spatial conditions, we also introduce a disentangled framework which leverages an initial object guidance. After synthesizing a wide range of data, we pre-train a model on them so that it learns to capture a domain-invariant geometry prior which is consistent across various domains. We validate its effectiveness by substantially improving 3D shape reconstruction models on a real-world benchmark. In a scale-up evaluation, our pre-training achieves 23.6% superior results compared with the pre-training on high-quality computer graphics renderings.
comment: Project Page: https://ObjectDR.github.io
☆ Machine-learning invariant foliations in forced systems for reduced order modelling
We identify reduced order models (ROM) of forced systems from data using invariant foliations. The forcing can be external, parametric, periodic or quasi-periodic. The process has four steps: 1. identify an approximate invariant torus and the linear dynamics about the torus; 2. identify a globally defined invariant foliation about the torus; 3. identify a local foliation about an invariant manifold that complements the global foliation 4. extract the invariant manifold as the leaf going through the torus and interpret the result. We combine steps 2 and 3, so that we can track the location of the invariant torus and scale the invariance equations appropriately. We highlight some fundamental limitations of invariant manifolds and foliations when fitting them to data, that require further mathematics to resolve.
☆ Constrained Reinforcement Learning with Smoothed Log Barrier Function
Reinforcement Learning (RL) has been widely applied to many control tasks and substantially improved the performances compared to conventional control methods in many domains where the reward function is well defined. However, for many real-world problems, it is often more convenient to formulate optimization problems in terms of rewards and constraints simultaneously. Optimizing such constrained problems via reward shaping can be difficult as it requires tedious manual tuning of reward functions with several interacting terms. Recent formulations which include constraints mostly require a pre-training phase, which often needs human expertise to collect data or assumes having a sub-optimal policy readily available. We propose a new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic. It implements an adaptive penalty for policy learning and alleviates the numerical issues that are known to complicate the application of the log barrier function method. As a result, we show that with CSAC-LB, we achieve state-of-the-art performance on several constrained control tasks with different levels of difficulty and evaluate our methods in a locomotion task on a real quadruped robot platform.
☆ Soft Learning Probabilistic Circuits
Probabilistic Circuits (PCs) are prominent tractable probabilistic models, allowing for a range of exact inferences. This paper focuses on the main algorithm for training PCs, LearnSPN, a gold standard due to its efficiency, performance, and ease of use, in particular for tabular data. We show that LearnSPN is a greedy likelihood maximizer under mild assumptions. While inferences in PCs may use the entire circuit structure for processing queries, LearnSPN applies a hard method for learning them, propagating at each sum node a data point through one and only one of the children/edges as in a hard clustering process. We propose a new learning procedure named SoftLearn, that induces a PC using a soft clustering process. We investigate the effect of this learning-inference compatibility in PCs. Our experiments show that SoftLearn outperforms LearnSPN in many situations, yielding better likelihoods and arguably better samples. We also analyze comparable tractable models to highlight the differences between soft/hard learning and model querying.
☆ Physics-Based Causal Reasoning for Safe & Robust Next-Best Action Selection in Robot Manipulation Tasks IROS
Safe and efficient object manipulation is a key enabler of many real-world robot applications. However, this is challenging because robot operation must be robust to a range of sensor and actuator uncertainties. In this paper, we present a physics-informed causal-inference-based framework for a robot to probabilistically reason about candidate actions in a block stacking task in a partially observable setting. We integrate a physics-based simulation of the rigid-body system dynamics with a causal Bayesian network (CBN) formulation to define a causal generative probabilistic model of the robot decision-making process. Using simulation-based Monte Carlo experiments, we demonstrate our framework's ability to successfully: (1) predict block tower stability with high accuracy (Pred Acc: 88.6%); and, (2) select an approximate next-best action for the block stacking task, for execution by an integrated robot system, achieving 94.2% task success rate. We also demonstrate our framework's suitability for real-world robot systems by demonstrating successful task executions with a domestic support robot, with perception and manipulation sub-system integration. Hence, we show that by embedding physics-based causal reasoning into robots' decision-making processes, we can make robot task execution safer, more reliable, and more robust to various types of uncertainty.
comment: 8 pages, 9 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
☆ HyperGALE: ASD Classification via Hypergraph Gated Attention with Learnable Hyperedges IJCNN 2024
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by varied social cognitive challenges and repetitive behavioral patterns. Identifying reliable brain imaging-based biomarkers for ASD has been a persistent challenge due to the spectrum's diverse symptomatology. Existing baselines in the field have made significant strides in this direction, yet there remains room for improvement in both performance and interpretability. We propose \emph{HyperGALE}, which builds upon the hypergraph by incorporating learned hyperedges and gated attention mechanisms. This approach has led to substantial improvements in the model's ability to interpret complex brain graph data, offering deeper insights into ASD biomarker characterization. Evaluated on the extensive ABIDE II dataset, \emph{HyperGALE} not only improves interpretability but also demonstrates statistically significant enhancements in key performance metrics compared to both previous baselines and the foundational hypergraph model. The advancement \emph{HyperGALE} brings to ASD research highlights the potential of sophisticated graph-based techniques in neurodevelopmental studies. The source code and implementation instructions are available at GitHub:https://github.com/mehular0ra/HyperGALE.
comment: Accepted to IJCNN 2024
☆ Utilizing the LightGBM Algorithm for Operator User Credit Assessment Research
Mobile Internet user credit assessment is an important way for communication operators to establish decisions and formulate measures, and it is also a guarantee for operators to obtain expected benefits. However, credit evaluation methods have long been monopolized by financial industries such as banks and credit. As supporters and providers of platform network technology and network resources, communication operators are also builders and maintainers of communication networks. Internet data improves the user's credit evaluation strategy. This paper uses the massive data provided by communication operators to carry out research on the operator's user credit evaluation model based on the fusion LightGBM algorithm. First, for the massive data related to user evaluation provided by operators, key features are extracted by data preprocessing and feature engineering methods, and a multi-dimensional feature set with statistical significance is constructed; then, linear regression, decision tree, LightGBM, and other machine learning algorithms build multiple basic models to find the best basic model; finally, integrates Averaging, Voting, Blending, Stacking and other integrated algorithms to refine multiple fusion models, and finally establish the most suitable fusion model for operator user evaluation.
☆ Detoxifying Large Language Models via Knowledge Editing
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments to compare knowledge editing approaches with previous baselines, indicating that knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxify approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.
comment: Ongoing work. Project website: https://zjunlp.github.io/project/SafeEdit Benchmark: https://huggingface.co/datasets/zjunlp/SafeEdit Code: https://github.com/zjunlp/EasyEdit
☆ Universal Feature Selection for Simultaneous Interpretability of Multitask Datasets
Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations to identify both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and suggest deep connections in seemingly-disparate chemical datasets. We expect these results to have important repercussions in manually-guided inverse problems. Beyond its current application, BoUTS holds immense potential for elucidating data-poor systems by leveraging information from similar data-rich systems. BoUTS represents a significant leap in cross-domain feature selection, potentially leading to advancements in various scientific fields.
comment: Main text: 14 pages, 3 figures, 1 table; SI: 7 pages, 1 figure, 4 tables, 3 algorithms
☆ gTBLS: Generating Tables from Text by Conditional Question Answering
Distilling large, unstructured text into a structured, condensed form such as tables is an open research problem. One of the primary challenges in automatically generating tables is ensuring their syntactic validity. Prior approaches address this challenge by including additional parameters in the Transformer's attention mechanism to attend to specific rows and column headers. In contrast to this single-stage method, this paper presents a two-stage approach called Generative Tables (gTBLS). The first stage infers table structure (row and column headers) from the text. The second stage formulates questions using these headers and fine-tunes a causal language model to answer them. Furthermore, the gTBLS approach is amenable to the utilization of pre-trained Large Language Models in a zero-shot configuration, presenting a solution for table generation in situations where fine-tuning is not feasible. gTBLS improves prior approaches by up to 10% in BERTScore on the table construction task and up to 20% on the table content generation task of the E2E, WikiTableText, WikiBio, and RotoWire datasets.
comment: 12 pages, 1 figure
Language Models Can Reduce Asymmetry in Information Markets
This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.
☆ Analysing Diffusion Segmentation for Medical Images
Denoising Diffusion Probabilistic models have become increasingly popular due to their ability to offer probabilistic modeling and generate diverse outputs. This versatility inspired their adaptation for image segmentation, where multiple predictions of the model can produce segmentation results that not only achieve high quality but also capture the uncertainty inherent in the model. Here, powerful architectures were proposed for improving diffusion segmentation performance. However, there is a notable lack of analysis and discussions on the differences between diffusion segmentation and image generation, and thorough evaluations are missing that distinguish the improvements these architectures provide for segmentation in general from their benefit for diffusion segmentation specifically. In this work, we critically analyse and discuss how diffusion segmentation for medical images differs from diffusion image generation, with a particular focus on the training behavior. Furthermore, we conduct an assessment how proposed diffusion segmentation architectures perform when trained directly for segmentation. Lastly, we explore how different medical segmentation tasks influence the diffusion segmentation behavior and the diffusion process could be adapted accordingly. With these analyses, we aim to provide in-depth insights into the behavior of diffusion segmentation that allow for a better design and evaluation of diffusion segmentation methods in the future.
☆ A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
comment: arXiv admin note: text overlap with arXiv:2312.03632
☆ Biased Binary Attribute Classifiers Ignore the Majority Classes
To visualize the regions of interest that classifiers base their decisions on, different Class Activation Mapping (CAM) methods have been developed. However, all of these techniques target categorical classifiers only, though most real-world tasks are binary classification. In this paper, we extend gradient-based CAM techniques to work with binary classifiers and visualize the active regions for binary facial attribute classifiers. When training an unbalanced binary classifier on an imbalanced dataset, it is well-known that the majority class, i.e. the class with many training samples, is mostly predicted much better than minority class with few training instances. In our experiments on the CelebA dataset, we verify these results, when training an unbalanced classifier to extract 40 facial attributes simultaneously. One would expect that the biased classifier has learned to extract features mainly for the majority classes and that the proportional energy of the activations mainly reside in certain specific regions of the image where the attribute is located. However, we find very little regular activation for samples of majority classes, while the active regions for minority classes seem mostly reasonable and overlap with our expectations. These results suggest that biased classifiers mainly rely on bias activation for majority classes. When training a balanced classifier on the imbalanced data by employing attribute-specific class weights, majority and minority classes are classified similarly well and show expected activations for almost all attributes
☆ Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation
Deep learning-based image generation has seen significant advancements with diffusion models, notably improving the quality of generated images. Despite these developments, generating images with unseen characteristics beneficial for downstream tasks has received limited attention. To bridge this gap, we propose Style-Extracting Diffusion Models, featuring two conditioning mechanisms. Specifically, we utilize 1) a style conditioning mechanism which allows to inject style information of previously unseen images during image generation and 2) a content conditioning which can be targeted to a downstream task, e.g., layout for segmentation. We introduce a trainable style encoder to extract style information from images, and an aggregation block that merges style information from multiple style inputs. This architecture enables the generation of images with unseen styles in a zero-shot manner, by leveraging styles from unseen images, resulting in more diverse generations. In this work, we use the image layout as target condition and first show the capability of our method on a natural image dataset as a proof-of-concept. We further demonstrate its versatility in histopathology, where we combine prior knowledge about tissue composition and unannotated data to create diverse synthetic images with known layouts. This allows us to generate additional synthetic data to train a segmentation network in a semi-supervised fashion. We verify the added value of the generated images by showing improved segmentation results and lower performance variability between patients when synthetic images are included during segmentation training. Our code will be made publicly available at [LINK].
☆ Task-optimal data-driven surrogate models for eNMPC via differentiable simulation and optimization
We present a method for end-to-end learning of Koopman surrogate models for optimal performance in control. In contrast to previous contributions that employ standard reinforcement learning (RL) algorithms, we use a training algorithm that exploits the potential differentiability of environments based on mechanistic simulation models. We evaluate the performance of our method by comparing it to that of other controller type and training algorithm combinations on a literature known eNMPC case study. Our method exhibits superior performance on this problem, thereby constituting a promising avenue towards more capable controllers that employ dynamic surrogate models.
comment: 6 pages, 4 figures, 1 table
☆ DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning
Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our \emph{differentially private retrieval-augmented diffusion model} (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of $\epsilon=10$, while providing a $3.5$ point improvement in FID compared to public-only retrieval for up to $10,000$ queries.
☆ Model Uncertainty in Evolutionary Optimization and Bayesian Optimization: A Comparative Analysis
Black-box optimization problems, which are common in many real-world applications, require optimization through input-output interactions without access to internal workings. This often leads to significant computational resources being consumed for simulations. Bayesian Optimization (BO) and Surrogate-Assisted Evolutionary Algorithm (SAEA) are two widely used gradient-free optimization techniques employed to address such challenges. Both approaches follow a similar iterative procedure that relies on surrogate models to guide the search process. This paper aims to elucidate the similarities and differences in the utilization of model uncertainty between these two methods, as well as the impact of model inaccuracies on algorithmic performance. A novel model-assisted strategy is introduced, which utilizes unevaluated solutions to generate offspring, leveraging the population-based search capabilities of evolutionary algorithm to enhance the effectiveness of model-assisted optimization. Experimental results demonstrate that the proposed approach outperforms mainstream Bayesian optimization algorithms in terms of accuracy and efficiency.
☆ GLC++: Source-Free Universal Domain Adaptation through Global-Local Clustering and Contrastive Affinity Learning CVPR 2023
Deep neural networks often exhibit sub-optimal performance under covariate and category shifts. Source-Free Domain Adaptation (SFDA) presents a promising solution to this dilemma, yet most SFDA approaches are restricted to closed-set scenarios. In this paper, we explore Source-Free Universal Domain Adaptation (SF-UniDA) aiming to accurately classify "known" data belonging to common categories and segregate them from target-private "unknown" data. We propose a novel Global and Local Clustering (GLC) technique, which comprises an adaptive one-vs-all global clustering algorithm to discern between target classes, complemented by a local k-NN clustering strategy to mitigate negative transfer. Despite the effectiveness, the inherent closed-set source architecture leads to uniform treatment of "unknown" data, impeding the identification of distinct "unknown" categories. To address this, we evolve GLC to GLC++, integrating a contrastive affinity learning strategy. We examine the superiority of GLC and GLC++ across multiple benchmarks and category shift scenarios. Remarkably, in the most challenging open-partial-set scenarios, GLC and GLC++ surpass GATE by 16.7% and 18.6% in H-score on VisDA, respectively. GLC++ enhances the novel category clustering accuracy of GLC by 4.3% in open-set scenarios on Office-Home. Furthermore, the introduced contrastive learning strategy not only enhances GLC but also significantly facilitates existing methodologies.
comment: This is a substantial extension of the CVPR 2023 paper "Upcycling Models under Domain and Category Shift"
☆ Physics-Informed Diffusion Models
Generative models such as denoising diffusion models are quickly advancing their ability to approximate highly complex data distributions. They are also increasingly leveraged in scientific machine learning, where samples from the implied data distribution are expected to adhere to specific governing equations. We present a framework to inform denoising diffusion models on underlying constraints on such generated samples during model training. Our approach improves the alignment of the generated samples with the imposed constraints and significantly outperforms existing methods without affecting inference speed. Additionally, our findings suggest that incorporating such constraints during training provides a natural regularization against overfitting. Our framework is easy to implement and versatile in its applicability for imposing equality and inequality constraints as well as auxiliary optimization objectives.
comment: 15 pages, 4 figures
☆ Regularized Adaptive Momentum Dual Averaging with an Efficient Inexact Subproblem Solver for Training Structured Neural Network
We propose a Regularized Adaptive Momentum Dual Averaging (RAMDA) algorithm for training structured neural networks. Similar to existing regularized adaptive methods, the subproblem for computing the update direction of RAMDA involves a nonsmooth regularizer and a diagonal preconditioner, and therefore does not possess a closed-form solution in general. We thus also carefully devise an implementable inexactness condition that retains convergence guarantees similar to the exact versions, and propose a companion efficient solver for the subproblems of both RAMDA and existing methods to make them practically feasible. We leverage the theory of manifold identification in variational analysis to show that, even in the presence of such inexactness, the iterates of RAMDA attain the ideal structure induced by the regularizer at the stationary point of asymptotic convergence. This structure is locally optimal near the point of convergence, so RAMDA is guaranteed to obtain the best structure possible among all methods converging to the same point, making it the first regularized adaptive method outputting models that possess outstanding predictive performance while being (locally) optimally structured. Extensive numerical experiments in large-scale modern computer vision, language modeling, and speech tasks show that the proposed RAMDA is efficient and consistently outperforms state of the art for training structured neural network. Implementation of our algorithm is available at http://www.github.com/ismoptgroup/RAMDA/.
☆ A Bag of Tricks for Few-Shot Class-Incremental Learning
We present a bag of tricks framework for few-shot class-incremental learning (FSCIL), which is a challenging form of continual learning that involves continuous adaptation to new tasks with limited samples. FSCIL requires both stability and adaptability, i.e., preserving proficiency in previously learned tasks while learning new ones. Our proposed bag of tricks brings together eight key and highly influential techniques that improve stability, adaptability, and overall performance under a unified framework for FSCIL. We organize these tricks into three categories: stability tricks, adaptability tricks, and training tricks. Stability tricks aim to mitigate the forgetting of previously learned classes by enhancing the separation between the embeddings of learned classes and minimizing interference when learning new ones. On the other hand, adaptability tricks focus on the effective learning of new classes. Finally, training tricks improve the overall performance without compromising stability or adaptability. We perform extensive experiments on three benchmark datasets, CIFAR-100, CUB-200, and miniIMageNet, to evaluate the impact of our proposed framework. Our detailed analysis shows that our approach substantially improves both stability and adaptability, establishing a new state-of-the-art by outperforming prior works in the area. We believe our method provides a go-to solution and establishes a robust baseline for future research in this area.
☆ Estimating Causal Effects with Double Machine Learning -- A Method Evaluation
The estimation of causal effects with observational data continues to be a very active research area. In recent years, researchers have developed new frameworks which use machine learning to relax classical assumptions necessary for the estimation of causal effects. In this paper, we review one of the most prominent methods - "double/debiased machine learning" (DML) - and empirically evaluate it by comparing its performance on simulated data relative to more traditional statistical methods, before applying it to real-world data. Our findings indicate that the application of a suitably flexible machine learning algorithm within DML improves the adjustment for various nonlinear confounding relationships. This advantage enables a departure from traditional functional form assumptions typically necessary in causal effect estimation. However, we demonstrate that the method continues to critically depend on standard assumptions about causal structure and identification. When estimating the effects of air pollution on housing prices in our application, we find that DML estimates are consistently larger than estimates of less flexible methods. From our overall results, we provide actionable recommendations for specific choices researchers must make when applying DML in practice.
☆ Tensor network compressibility of convolutional models
Convolutional neural networks (CNNs) represent one of the most widely used neural network architectures, showcasing state-of-the-art performance in computer vision tasks. Although larger CNNs generally exhibit higher accuracy, their size can be effectively reduced by "tensorization" while maintaining accuracy. Tensorization consists of replacing the convolution kernels with compact decompositions such as Tucker, Canonical Polyadic decompositions, or quantum-inspired decompositions such as matrix product states, and directly training the factors in the decompositions to bias the learning towards low-rank decompositions. But why doesn't tensorization seem to impact the accuracy adversely? We explore this by assessing how truncating the convolution kernels of dense (untensorized) CNNs impact their accuracy. Specifically, we truncated the kernels of (i) a vanilla four-layer CNN and (ii) ResNet-50 pre-trained for image classification on CIFAR-10 and CIFAR-100 datasets. We found that kernels (especially those inside deeper layers) could often be truncated along several cuts resulting in significant loss in kernel norm but not in classification accuracy. This suggests that such ``correlation compression'' (underlying tensorization) is an intrinsic feature of how information is encoded in dense CNNs. We also found that aggressively truncated models could often recover the pre-truncation accuracy after only a few epochs of re-training, suggesting that compressing the internal correlations of convolution layers does not often transport the model to a worse minimum. Our results can be applied to tensorize and compress CNN models more effectively.
comment: 20 pages, 21 images
☆ Knowledge-Enhanced Recommendation with User-Centric Subgraph Network
Recommendation systems, as widely implemented nowadays on various platforms, recommend relevant items to users based on their preferences. The classical methods which rely on user-item interaction matrices has limitations, especially in scenarios where there is a lack of interaction data for new items. Knowledge graph (KG)-based recommendation systems have emerged as a promising solution. However, most KG-based methods adopt node embeddings, which do not provide personalized recommendations for different users and cannot generalize well to the new items. To address these limitations, we propose Knowledge-enhanced User-Centric subgraph Network (KUCNet), a subgraph learning approach with graph neural network (GNN) for effective recommendation. KUCNet constructs a U-I subgraph for each user-item pair that captures both the historical information of user-item interactions and the side information provided in KG. An attention-based GNN is designed to encode the U-I subgraphs for recommendation. Considering efficiency, the pruned user-centric computation graph is further introduced such that multiple U-I subgraphs can be simultaneously computed and that the size can be pruned by Personalized PageRank. Our proposed method achieves accurate, efficient, and interpretable recommendations especially for new items. Experimental results demonstrate the superiority of KUCNet over state-of-the-art KG-based and collaborative filtering (CF)-based methods.
☆ Loop Improvement: An Efficient Approach for Extracting Shared Features from Heterogeneous Data without Central Server
In federated learning, data heterogeneity significantly impacts performance. A typical solution involves segregating these parameters into shared and personalized components, a concept also relevant in multi-task learning. Addressing this, we propose "Loop Improvement" (LI), a novel method enhancing this separation and feature extraction without necessitating a central server or data interchange among participants. Our experiments reveal LI's superiority in several aspects: In personalized federated learning environments, LI consistently outperforms the advanced FedALA algorithm in accuracy across diverse scenarios. Additionally, LI's feature extractor closely matches the performance achieved when aggregating data from all clients. In global model contexts, employing LI with stacked personalized layers and an additional network also yields comparable results to combined client data scenarios. Furthermore, LI's adaptability extends to multi-task learning, streamlining the extraction of common features across tasks and obviating the need for simultaneous training. This approach not only enhances individual task performance but also achieves accuracy levels on par with classic multi-task learning methods where all tasks are trained simultaneously. LI integrates a loop topology with layer-wise and end-to-end training, compatible with various neural network models. This paper also delves into the theoretical underpinnings of LI's effectiveness, offering insights into its potential applications. The code is on https://github.com/axedge1983/LI
comment: 11 pages, 11 figures
☆ Varroa destructor detection on honey bees using hyperspectral imagery
Hyperspectral (HS) imagery in agriculture is becoming increasingly common. These images have the advantage of higher spectral resolution. Advanced spectral processing techniques are required to unlock the information potential in these HS images. The present paper introduces a method rooted in multivariate statistics designed to detect parasitic Varroa destructor mites on the body of western honey bee Apis mellifera, enabling easier and continuous monitoring of the bee hives. The methodology explores unsupervised (K-means++) and recently developed supervised (Kernel Flows - Partial Least-Squares, KF-PLS) methods for parasitic identification. Additionally, in light of the emergence of custom-band multispectral cameras, the present research outlines a strategy for identifying the specific wavelengths necessary for effective bee-mite separation, suitable for implementation in a custom-band camera. Illustrated with a real-case dataset, our findings demonstrate that as few as four spectral bands are sufficient for accurate parasite identification.
☆ Exploring the Potential of Large Language Models in Graph Generation
Large language models (LLMs) have achieved great success in many fields, and recent works have studied exploring LLMs for graph discriminative tasks such as node classification. However, the abilities of LLMs for graph generation remain unexplored in the literature. Graph generation requires the LLM to generate graphs with given properties, which has valuable real-world applications such as drug discovery, while tends to be more challenging. In this paper, we propose LLM4GraphGen to explore the ability of LLMs for graph generation with systematical task designs and extensive experiments. Specifically, we propose several tasks tailored with comprehensive experiments to address key questions regarding LLMs' understanding of different graph structure rules, their ability to capture structural type distributions, and their utilization of domain knowledge for property-based graph generation. Our evaluations demonstrate that LLMs, particularly GPT-4, exhibit preliminary abilities in graph generation tasks, including rule-based and distribution-based generation. We also observe that popular prompting methods, such as few-shot and chain-of-thought prompting, do not consistently enhance performance. Besides, LLMs show potential in generating molecules with specific properties. These findings may serve as foundations for designing good LLMs based models for graph generation and provide valuable insights and further research.
☆ DomainLab: A modular Python package for domain generalization in deep learning
Poor generalization performance caused by distribution shifts in unseen domains often hinders the trustworthy deployment of deep neural networks. Many domain generalization techniques address this problem by adding a domain invariant regularization loss terms during training. However, there is a lack of modular software that allows users to combine the advantages of different methods with minimal effort for reproducibility. DomainLab is a modular Python package for training user specified neural networks with composable regularization loss terms. Its decoupled design allows the separation of neural networks from regularization loss construction. Hierarchical combinations of neural networks, different domain generalization methods, and associated hyperparameters, can all be specified together with other experimental setup in a single configuration file. Hierarchical combinations of neural networks, different domain generalization methods, and associated hyperparameters, can all be specified together with other experimental setup in a single configuration file. In addition, DomainLab offers powerful benchmarking functionality to evaluate the generalization performance of neural networks in out-of-distribution data. The package supports running the specified benchmark on an HPC cluster or on a standalone machine. The package is well tested with over 95 percent coverage and well documented. From the user perspective, it is closed to modification but open to extension. The package is under the MIT license, and its source code, tutorial and documentation can be found at https://github.com/marrlab/DomainLab.
☆ DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics
Deep neural network (DNN) video analytics is crucial for autonomous systems such as self-driving vehicles, unmanned aerial vehicles (UAVs), and security robots. However, real-world deployment faces challenges due to their limited computational resources and battery power. To tackle these challenges, continuous learning exploits a lightweight "student" model at deployment (inference), leverages a larger "teacher" model for labeling sampled data (labeling), and continuously retrains the student model to adapt to changing scenarios (retraining). This paper highlights the limitations in state-of-the-art continuous learning systems: (1) they focus on computations for retraining, while overlooking the compute needs for inference and labeling, (2) they rely on power-hungry GPUs, unsuitable for battery-operated autonomous systems, and (3) they are located on a remote centralized server, intended for multi-tenant scenarios, again unsuitable for autonomous systems due to privacy, network availability, and latency concerns. We propose a hardware-algorithm co-designed solution for continuous learning, DaCapo, that enables autonomous systems to perform concurrent executions of inference, labeling, and training in a performant and energy-efficient manner. DaCapo comprises (1) a spatially-partitionable and precision-flexible accelerator enabling parallel execution of kernels on sub-accelerators at their respective precisions, and (2) a spatiotemporal resource allocation algorithm that strategically navigates the resource-accuracy tradeoff space, facilitating optimal decisions for resource allocation to achieve maximal accuracy. Our evaluation shows that DaCapo achieves 6.5% and 5.5% higher accuracy than a state-of-the-art GPU-based continuous learning systems, Ekya and EOMU, respectively, while consuming 254x less power.
☆ Exploring Task Unification in Graph Representation Learning via Generative Approach
Graphs are ubiquitous in real-world scenarios and encompass a diverse range of tasks, from node-, edge-, and graph-level tasks to transfer learning. However, designing specific tasks for each type of graph data is often costly and lacks generalizability. Recent endeavors under the "Pre-training + Fine-tuning" or "Pre-training + Prompt" paradigms aim to design a unified framework capable of generalizing across multiple graph tasks. Among these, graph autoencoders (GAEs), generative self-supervised models, have demonstrated their potential in effectively addressing various graph tasks. Nevertheless, these methods typically employ multi-stage training and require adaptive designs, which on one hand make it difficult to be seamlessly applied to diverse graph tasks and on the other hand overlook the negative impact caused by discrepancies in task objectives between the different stages. To address these challenges, we propose GA^2E, a unified adversarially masked autoencoder capable of addressing the above challenges seamlessly. Specifically, GA^2E proposes to use the subgraph as the meta-structure, which remains consistent across all graph tasks (ranging from node-, edge-, and graph-level to transfer learning) and all stages (both during training and inference). Further, GA^2E operates in a \textbf{"Generate then Discriminate"} manner. It leverages the masked GAE to reconstruct the input subgraph whilst treating it as a generator to compel the reconstructed graphs resemble the input subgraph. Furthermore, GA^2E introduces an auxiliary discriminator to discern the authenticity between the reconstructed (generated) subgraph and the input subgraph, thus ensuring the robustness of the graph representation through adversarial training mechanisms. We validate GA^2E's capabilities through extensive experiments on 21 datasets across four types of graph tasks.
☆ $\nabla τ$: Gradient-based and Task-Agnostic machine Unlearning
Machine Unlearning, the process of selectively eliminating the influence of certain data examples used during a model's training, has gained significant attention as a means for practitioners to comply with recent data protection regulations. However, existing unlearning methods face critical drawbacks, including their prohibitively high cost, often associated with a large number of hyperparameters, and the limitation of forgetting only relatively small data portions. This often makes retraining the model from scratch a quicker and more effective solution. In this study, we introduce Gradient-based and Task-Agnostic machine Unlearning ($\nabla \tau$), an optimization framework designed to remove the influence of a subset of training data efficiently. It applies adaptive gradient ascent to the data to be forgotten while using standard gradient descent for the remaining data. $\nabla \tau$ offers multiple benefits over existing approaches. It enables the unlearning of large sections of the training dataset (up to 30%). It is versatile, supporting various unlearning tasks (such as subset forgetting or class removal) and applicable across different domains (images, text, etc.). Importantly, $\nabla \tau$ requires no hyperparameter adjustments, making it a more appealing option than retraining the model from scratch. We evaluate our framework's effectiveness using a set of well-established Membership Inference Attack metrics, demonstrating up to 10% enhancements in performance compared to state-of-the-art methods without compromising the original model's accuracy.
comment: 14 pages, 2 figures
☆ A Differentially Private Clustering Algorithm for Well-Clustered Graphs
We study differentially private (DP) algorithms for recovering clusters in well-clustered graphs, which are graphs whose vertex set can be partitioned into a small number of sets, each inducing a subgraph of high inner conductance and small outer conductance. Such graphs have widespread application as a benchmark in the theoretical analysis of spectral clustering. We provide an efficient ($\epsilon$,$\delta$)-DP algorithm tailored specifically for such graphs. Our algorithm draws inspiration from the recent work of Chen et al., who developed DP algorithms for recovery of stochastic block models in cases where the graph comprises exactly two nearly-balanced clusters. Our algorithm works for well-clustered graphs with $k$ nearly-balanced clusters, and the misclassification ratio almost matches the one of the best-known non-private algorithms. We conduct experimental evaluations on datasets with known ground truth clusters to substantiate the prowess of our algorithm. We also show that any (pure) $\epsilon$-DP algorithm would result in substantial error.
☆ Distilling Reinforcement Learning Policies for Interpretable Robot Locomotion: Gradient Boosting Machines and Symbolic Regression
Recent advancements in reinforcement learning (RL) have led to remarkable achievements in robot locomotion capabilities. However, the complexity and ``black-box'' nature of neural network-based RL policies hinder their interpretability and broader acceptance, particularly in applications demanding high levels of safety and reliability. This paper introduces a novel approach to distill neural RL policies into more interpretable forms using Gradient Boosting Machines (GBMs), Explainable Boosting Machines (EBMs) and Symbolic Regression. By leveraging the inherent interpretability of generalized additive models, decision trees, and analytical expressions, we transform opaque neural network policies into more transparent ``glass-box'' models. We train expert neural network policies using RL and subsequently distill them into (i) GBMs, (ii) EBMs, and (iii) symbolic policies. To address the inherent distribution shift challenge of behavioral cloning, we propose to use the Dataset Aggregation (DAgger) algorithm with a curriculum of episode-dependent alternation of actions between expert and distilled policies, to enable efficient distillation of feedback control policies. We evaluate our approach on various robot locomotion gaits -- walking, trotting, bounding, and pacing -- and study the importance of different observations in joint actions for distilled policies using various methods. We train neural expert policies for 205 hours of simulated experience and distill interpretable policies with only 10 minutes of simulated interaction for each gait using the proposed method.
☆ Investigating the validity of structure learning algorithms in identifying risk factors for intervention in patients with diabetes
Diabetes, a pervasive and enduring health challenge, imposes significant global implications on health, financial healthcare systems, and societal well-being. This study undertakes a comprehensive exploration of various structural learning algorithms to discern causal pathways amongst potential risk factors influencing diabetes progression. The methodology involves the application of these algorithms to relevant diabetes data, followed by the conversion of their output graphs into Causal Bayesian Networks (CBNs), enabling predictive analysis and the evaluation of discrepancies in the effect of hypothetical interventions within our context-specific case study. This study highlights the substantial impact of algorithm selection on intervention outcomes. To consolidate insights from diverse algorithms, we employ a model-averaging technique that helps us obtain a unique causal model for diabetes derived from a varied set of structural learning algorithms. We also investigate how each of those individual graphs, as well as the average graph, compare to the structures elicited by a domain expert who categorised graph edges into high confidence, moderate, and low confidence types, leading into three individual graphs corresponding to the three levels of confidence. The resulting causal model and data are made available online, and serve as a valuable resource and a guide for informed decision-making by healthcare practitioners, offering a comprehensive understanding of the interactions between relevant risk factors and the effect of hypothetical interventions. Therefore, this research not only contributes to the academic discussion on diabetes, but also provides practical guidance for healthcare professionals in developing efficient intervention and risk management strategies.
comment: 20 pages, 17 figures, 4 tables
☆ Neural Network-Based Processing and Reconstruction of Compromised Biophotonic Image Data
The integration of deep learning techniques with biophotonic setups has opened new horizons in bioimaging. A compelling trend in this field involves deliberately compromising certain measurement metrics to engineer better bioimaging tools in terms of cost, speed, and form-factor, followed by compensating for the resulting defects through the utilization of deep learning models trained on a large amount of ideal, superior or alternative data. This strategic approach has found increasing popularity due to its potential to enhance various aspects of biophotonic imaging. One of the primary motivations for employing this strategy is the pursuit of higher temporal resolution or increased imaging speed, critical for capturing fine dynamic biological processes. This approach also offers the prospect of simplifying hardware requirements/complexities, thereby making advanced imaging standards more accessible in terms of cost and/or size. This article provides an in-depth review of the diverse measurement aspects that researchers intentionally impair in their biophotonic setups, including the point spread function, signal-to-noise ratio, sampling density, and pixel resolution. By deliberately compromising these metrics, researchers aim to not only recuperate them through the application of deep learning networks, but also bolster in return other crucial parameters, such as the field-of-view, depth-of-field, and space-bandwidth product. Here, we discuss various biophotonic methods that have successfully employed this strategic approach. These techniques span broad applications and showcase the versatility and effectiveness of deep learning in the context of compromised biophotonic data. Finally, by offering our perspectives on the future possibilities of this rapidly evolving concept, we hope to motivate our readers to explore novel ways of balancing hardware compromises with compensation via AI.
comment: 17 Pages, 4 Figures, 1 Table
☆ SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks CVPR
The remarkable success of Vision Transformers in Artificial Neural Networks (ANNs) has led to a growing interest in incorporating the self-attention mechanism and transformer-based architecture into Spiking Neural Networks (SNNs). While existing methods propose spiking self-attention mechanisms that are compatible with SNNs, they lack reasonable scaling methods, and the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting local features. To address these challenges, we propose a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a reasonable scaling method. Based on DSSA, we propose a novel spiking Vision Transformer architecture called SpikingResformer, which combines the ResNet-based multi-stage architecture with our proposed DSSA to improve both performance and energy efficiency while reducing parameters. Experimental results show that SpikingResformer achieves higher accuracy with fewer parameters and lower energy consumption than other spiking Vision Transformer counterparts. Notably, our SpikingResformer-L achieves 79.40% top-1 accuracy on ImageNet with 4 time-steps, which is the state-of-the-art result in the SNN field.
comment: To be published in the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
☆ Impact Assessment of Missing Data in Model Predictions for Earth Observation Applications
Earth observation (EO) applications involving complex and heterogeneous data sources are commonly approached with machine learning models. However, there is a common assumption that data sources will be persistently available. Different situations could affect the availability of EO sources, like noise, clouds, or satellite mission failures. In this work, we assess the impact of missing temporal and static EO sources in trained models across four datasets with classification and regression tasks. We compare the predictive quality of different methods and find that some are naturally more robust to missing data. The Ensemble strategy, in particular, achieves a prediction robustness up to 100%. We evidence that missing scenarios are significantly more challenging in regression than classification tasks. Finally, we find that the optical view is the most critical view when it is missing individually.
comment: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2024
☆ Exploring Green AI for Audio Deepfake Detection
The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is equivalent to five times of average US car emission at its lifetime. This is certainly a massive threat to the environment. To tackle this challenge, this study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources. Our proposed framework utilizes off-the-shelve self-supervised learning (SSL) based models which are pre-trained and available in public repositories. In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model. Our approach shows competitive results compared to the commonly used high-carbon footprint approaches. In experiments with the ASVspoof 2019 LA dataset, we achieve a 0.90\% equal error rate (EER) with less than 1k trainable model parameters. To encourage further research in this direction and support reproducible results, the Python code will be made publicly accessible following acceptance. Github: https://github.com/sahasubhajit/Speech-Spoofing-
comment: This manuscript is under review in a conference
☆ Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
comment: Manuscript Under Review
☆ How to be fair? A study of label and selection bias
It is widely accepted that biased data leads to biased and thus potentially unfair models. Therefore, several measures for bias in data and model predictions have been proposed, as well as bias mitigation techniques whose aim is to learn models that are fair by design. Despite the myriad of mitigation techniques developed in the past decade, however, it is still poorly understood under what circumstances which methods work. Recently, Wick et al. showed, with experiments on synthetic data, that there exist situations in which bias mitigation techniques lead to more accurate models when measured on unbiased data. Nevertheless, in the absence of a thorough mathematical analysis, it remains unclear which techniques are effective under what circumstances. We propose to address this problem by establishing relationships between the type of bias and the effectiveness of a mitigation technique, where we categorize the mitigation techniques by the bias measure they optimize. In this paper we illustrate this principle for label and selection bias on the one hand, and demographic parity and ``We're All Equal'' on the other hand. Our theoretical analysis allows to explain the results of Wick et al. and we also show that there are situations where minimizing fairness measures does not result in the fairest possible distribution.
☆ Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
☆ Diffusion Models with Ensembled Structure-Based Anomaly Scoring for Unsupervised Anomaly Detection
Supervised deep learning techniques show promise in medical image analysis. However, they require comprehensive annotated data sets, which poses challenges, particularly for rare diseases. Consequently, unsupervised anomaly detection (UAD) emerges as a viable alternative for pathology segmentation, as only healthy data is required for training. However, recent UAD anomaly scoring functions often focus on intensity only and neglect structural differences, which impedes the segmentation performance. This work investigates the potential of Structural Similarity (SSIM) to bridge this gap. SSIM captures both intensity and structural disparities and can be advantageous over the classical $l1$ error. However, we show that there is more than one optimal kernel size for the SSIM calculation for different pathologies. Therefore, we investigate an adaptive ensembling strategy for various kernel sizes to offer a more pathology-agnostic scoring mechanism. We demonstrate that this ensembling strategy can enhance the performance of DMs and mitigate the sensitivity to different kernel sizes across varying pathologies, highlighting its promise for brain MRI anomaly detection.
comment: Accepted at IEEE ISBI 2024
☆ ERD: A Framework for Improving LLM Reasoning for Cognitive Distortion Classification
Improving the accessibility of psychotherapy with the aid of Large Language Models (LLMs) is garnering a significant attention in recent years. Recognizing cognitive distortions from the interviewee's utterances can be an essential part of psychotherapy, especially for cognitive behavioral therapy. In this paper, we propose ERD, which improves LLM-based cognitive distortion classification performance with the aid of additional modules of (1) extracting the parts related to cognitive distortion, and (2) debating the reasoning steps by multiple agents. Our experimental results on a public dataset show that ERD improves the multi-class F1 score as well as binary specificity score. Regarding the latter score, it turns out that our method is effective in debiasing the baseline method which has high false positive rate, especially when the summary of multi-agent debate is provided to LLMs.
☆ LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding LREC
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale language models (LLMs). By leveraging the strengths of existing research in document image understanding and LLMs' superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
comment: LREC-COLING 2024
☆ Isotropic Gaussian Splatting for Real-Time Radiance Field Rendering
The 3D Gaussian splatting method has drawn a lot of attention, thanks to its high performance in training and high quality of the rendered image. However, it uses anisotropic Gaussian kernels to represent the scene. Although such anisotropic kernels have advantages in representing the geometry, they lead to difficulties in terms of computation, such as splitting or merging two kernels. In this paper, we propose to use isotropic Gaussian kernels to avoid such difficulties in the computation, leading to a higher performance method. The experiments confirm that the proposed method is about {\bf 100X} faster without losing the geometry representation accuracy. The proposed method can be applied in a large range applications where the radiance field is needed, such as 3D reconstruction, view synthesis, and dynamic object modeling.
☆ A Unified Framework for Model Editing
Model editing is a growing area focused on updating the knowledge embedded within models. Among the various methodologies, ROME and MEMIT stand out as leading "locate-and-edit" model editing techniques. While MEMIT enables batched editing of memories, ROME is limited to changing one fact at a time. This paper introduces a unifying framework that brings ROME and MEMIT under a single conceptual umbrella, optimizing for the same goal, which we call the "preservation-memorization" objective. This objective aims to preserve the representations of certain selected vectors while memorizing the representations of new factual information. Specifically, ROME optimizes this objective using an equality constraint, whereas MEMIT employs a more flexible least-square constraint. In addition to making batched edits, MEMIT also edits the model at multiple layers. We disentangle the distribution of edits to multiple layers from the optimization objective of MEMIT and show that these edit-distribution algorithms should be considered separate entities worthy of their own line of research. Finally, we present EMMET - an Equality-constrained Mass Model Editing algorithm for Transformers, a new batched memory-editing algorithm. With EMMET, we present a closed form solution for the equality-constrained version of the preservation-memorization objective. We show that EMMET is able to perform batched-edits on par with MEMIT up to a batch-size of 256 and discuss the challenges in stabilizing EMMET. By articulating the "locate-and-edit" model editing algorithms under a simple conceptual framework of "preservation-memorization", we aim to bridge the gap between intuition and mathematics and hope to simplify the journey for future researchers in model editing.
☆ RG-CAT: Detection Pipeline and Catalogue of Radio Galaxies in the EMU Pilot Survey
We present source detection and catalogue construction pipelines to build the first catalogue of radio galaxies from the 270 $\rm deg^2$ pilot survey of the Evolutionary Map of the Universe (EMU-PS) conducted with the Australian Square Kilometre Array Pathfinder (ASKAP) telescope. The detection pipeline uses Gal-DINO computer-vision networks (Gupta et al., 2024) to predict the categories of radio morphology and bounding boxes for radio sources, as well as their potential infrared host positions. The Gal-DINO network is trained and evaluated on approximately 5,000 visually inspected radio galaxies and their infrared hosts, encompassing both compact and extended radio morphologies. We find that the Intersection over Union (IoU) for the predicted and ground truth bounding boxes is larger than 0.5 for 99% of the radio sources, and 98% of predicted host positions are within $3^{\prime \prime}$ of the ground truth infrared host in the evaluation set. The catalogue construction pipeline uses the predictions of the trained network on the radio and infrared image cutouts based on the catalogue of radio components identified using the Selavy source finder algorithm. Confidence scores of the predictions are then used to prioritize Selavy components with higher scores and incorporate them first into the catalogue. This results in identifications for a total of 211,625 radio sources, with 201,211 classified as compact and unresolved. The remaining 10,414 are categorized as extended radio morphologies, including 582 FR-I, 5,602 FR-II, 1,494 FR-x (uncertain whether FR-I or FR-II), 2,375 R (single-peak resolved) radio galaxies, and 361 with peculiar and other rare morphologies. We cross-match the radio sources in the catalogue with the infrared and optical catalogues, finding infrared cross-matches for 73% and photometric redshifts for 36% of the radio galaxies.
comment: Accepted for publication in PASA. The paper has 22 pages, 12 figures and 5 tables
☆ SoftPatch: Unsupervised Anomaly Detection with Noisy Data
Although mainstream unsupervised anomaly detection (AD) algorithms perform well in academic datasets, their performance is limited in practical application due to the ideal experimental setting of clean training data. Training with noisy data is an inevitable problem in real-world anomaly detection but is seldom discussed. This paper considers label-level noise in image sensory anomaly detection for the first time. To solve this problem, we proposed a memory-based unsupervised AD method, SoftPatch, which efficiently denoises the data at the patch level. Noise discriminators are utilized to generate outlier scores for patch-level noise elimination before coreset construction. The scores are then stored in the memory bank to soften the anomaly detection boundary. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset. Comprehensive experiments in various noise scenes demonstrate that SoftPatch outperforms the state-of-the-art AD methods on the MVTecAD and BTAD benchmarks and is comparable to those methods under the setting without noise.
comment: 36th Conference on Neural Information Processing Systems
☆ Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation
Estimating the individuals' potential response to varying treatment doses is crucial for decision-making in areas such as precision medicine and management science. Most recent studies predict counterfactual outcomes by learning a covariate representation that is independent of the treatment variable. However, such independence constraints neglect much of the covariate information that is useful for counterfactual prediction, especially when the treatment variables are continuous. To tackle the above issue, in this paper, we first theoretically demonstrate the importance of the balancing and prognostic representations for unbiased estimation of the heterogeneous dose-response curves, that is, the learned representations are constrained to satisfy the conditional independence between the covariates and both of the treatment variables and the potential responses. Based on this, we propose a novel Contrastive balancing Representation learning Network using a partial distance measure, called CRNet, for estimating the heterogeneous dose-response curves without losing the continuity of treatments. Extensive experiments are conducted on synthetic and real-world datasets demonstrating that our proposal significantly outperforms previous methods.
☆ Recovering Latent Confounders from High-dimensional Proxy Variables
Detecting latent confounders from proxy variables is an essential problem in causal effect estimation. Previous approaches are limited to low-dimensional proxies, sorted proxies, and binary treatments. We remove these assumptions and present a novel Proxy Confounder Factorization (PCF) framework for continuous treatment effect estimation when latent confounders manifest through high-dimensional, mixed proxy variables. For specific sample sizes, our two-step PCF implementation, using Independent Component Analysis (ICA-PCF), and the end-to-end implementation, using Gradient Descent (GD-PCF), achieve high correlation with the latent confounder and low absolute error in causal effect estimation with synthetic datasets in the high sample size regime. Even when faced with climate data, ICA-PCF recovers four components that explain $75.9\%$ of the variance in the North Atlantic Oscillation, a known confounder of precipitation patterns in Europe. Code for our PCF implementations and experiments can be found here: https://github.com/IPL-UV/confound_it. The proposed methodology constitutes a stepping stone towards discovering latent confounders and can be applied to many problems in disciplines dealing with high-dimensional observed proxies, e.g., spatiotemporal fields.
☆ Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights
Bayesian approaches for training deep neural networks (BNNs) have received significant interest and have been effectively utilized in a wide range of applications. There have been several studies on the properties of posterior concentrations of BNNs. However, most of these studies only demonstrate results in BNN models with sparse or heavy-tailed priors. Surprisingly, no theoretical results currently exist for BNNs using Gaussian priors, which are the most commonly used one. The lack of theory arises from the absence of approximation results of Deep Neural Networks (DNNs) that are non-sparse and have bounded parameters. In this paper, we present a new approximation theory for non-sparse DNNs with bounded parameters. Additionally, based on the approximation theory, we show that BNNs with non-sparse general priors can achieve near-minimax optimal posterior concentration rates to the true model.
☆ Debiasing surgeon: fantastic weights and how to find them
Nowadays an ever-growing concerning phenomenon, the emergence of algorithmic biases that can lead to unfair models, emerges. Several debiasing approaches have been proposed in the realm of deep learning, employing more or less sophisticated approaches to discourage these models from massively employing these biases. However, a question emerges: is this extra complexity really necessary? Is a vanilla-trained model already embodying some ``unbiased sub-networks'' that can be used in isolation and propose a solution without relying on the algorithmic biases? In this work, we show that such a sub-network typically exists, and can be extracted from a vanilla-trained model without requiring additional training. We further validate that such specific architecture is incapable of learning a specific bias, suggesting that there are possible architectural countermeasures to the problem of biases in deep neural networks.
☆ OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation
The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA), which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.
comment: 22 pages, 7 figures
☆ Policy Mirror Descent with Lookahead
Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as AlphaGo and AlphaZero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called $h$-PMD which incorporates multi-step greedy policy improvement with lookahead depth $h$ to the PMD update rule. To solve discounted infinite horizon Markov Decision Processes with discount factor $\gamma$, we show that $h$-PMD which generalizes the standard PMD enjoys a faster dimension-free $\gamma^h$-linear convergence rate, contingent on the computation of multi-step greedy policies. We propose an inexact version of $h$-PMD where lookahead action values are estimated. Under a generative model, we establish a sample complexity for $h$-PMD which improves over prior work. Finally, we extend our result to linear function approximation to scale to large state spaces. Under suitable assumptions, our sample complexity only involves dependence on the dimension of the feature map space instead of the state space size.
☆ Deep Learning for Trajectory Data Management and Mining: A Survey and Beyond
Trajectory computing is a pivotal domain encompassing trajectory data management and mining, garnering widespread attention due to its crucial role in various practical applications such as location services, urban traffic, and public safety. Traditional methods, focusing on simplistic spatio-temporal features, face challenges of complex calculations, limited scalability, and inadequate adaptability to real-world complexities. In this paper, we present a comprehensive review of the development and recent advances in deep learning for trajectory computing (DL4Traj). We first define trajectory data and provide a brief overview of widely-used deep learning models. Systematically, we explore deep learning applications in trajectory management (pre-processing, storage, analysis, and visualization) and mining (trajectory-related forecasting, trajectory-related recommendation, trajectory classification, travel time estimation, anomaly detection, and mobility generation). Notably, we encapsulate recent advancements in Large Language Models (LLMs) that hold the potential to augment trajectory computing. Additionally, we summarize application scenarios, public datasets, and toolkits. Finally, we outline current challenges in DL4Traj research and propose future directions. Relevant papers and open-source resources have been collated and are continuously updated at: \href{https://github.com/yoshall/Awesome-Trajectory-Computing}{DL4Traj Repo}.
comment: 25 pages, 12 figures, 5 tables
☆ Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition ICLR 2024
Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.
comment: ICLR 2024. Project page: https://sihyun.me/CMD
☆ Learning Decomposable and Debiased Representations via Attribute-Centric Information Bottlenecks
Biased attributes, spuriously correlated with target labels in a dataset, can problematically lead to neural networks that learn improper shortcuts for classifications and limit their capabilities for out-of-distribution (OOD) generalization. Although many debiasing approaches have been proposed to ensure correct predictions from biased datasets, few studies have considered learning latent embedding consisting of intrinsic and biased attributes that contribute to improved performance and explain how the model pays attention to attributes. In this paper, we propose a novel debiasing framework, Debiasing Global Workspace, introducing attention-based information bottlenecks for learning compositional representations of attributes without defining specific bias types. Based on our observation that learning shape-centric representation helps robust performance on OOD datasets, we adopt those abilities to learn robust and generalizable representations of decomposable latent embeddings corresponding to intrinsic and biasing attributes. We conduct comprehensive evaluations on biased datasets, along with both quantitative and qualitative analyses, to showcase our approach's efficacy in attribute-centric representation learning and its ability to differentiate between intrinsic and bias-related features.
comment: 24 pages, 16 figures, 3 tables
☆ Genetic Programming for Explainable Manifold Learning
Manifold learning techniques play a pivotal role in machine learning by revealing lower-dimensional embeddings within high-dimensional data, thus enhancing both the efficiency and interpretability of data analysis by transforming the data into a lower-dimensional representation. However, a notable challenge with current manifold learning methods is their lack of explicit functional mappings, crucial for explainability in many real-world applications. Genetic programming, known for its interpretable functional tree-based models, has emerged as a promising approach to address this challenge. Previous research leveraged multi-objective GP to balance manifold quality against embedding dimensionality, producing functional mappings across a range of embedding sizes. Yet, these mapping trees often became complex, hindering explainability. In response, in this paper, we introduce Genetic Programming for Explainable Manifold Learning (GP-EMaL), a novel approach that directly penalises tree complexity. Our new method is able to maintain high manifold quality while significantly enhancing explainability and also allows customisation of complexity measures, such as symmetry balancing, scaling, and node complexity, catering to diverse application needs. Our experimental analysis demonstrates that GP-EMaL is able to match the performance of the existing approach in most cases, while using simpler, smaller, and more interpretable tree structures. This advancement marks a significant step towards achieving interpretable manifold learning.
☆ Improving Image Classification Accuracy through Complementary Intra-Class and Inter-Class Mixup
MixUp and its variants, such as Manifold MixUp, have two key limitations in image classification tasks. First, they often neglect mixing within the same class (intra-class mixup), leading to an underutilization of the relationships among samples within the same class. Second, although these methods effectively enhance inter-class separability by mixing between different classes (inter-class mixup), they fall short in improving intra-class cohesion through their mixing operations, limiting their classification performance. To tackle these issues, we propose a novel mixup method and a comprehensive integrated solution.Our mixup approach specifically targets intra-class mixup, an aspect commonly overlooked, to strengthen intra-class cohesion-a feature not provided by current mixup techniques.For each mini-batch, our method utilizes feature representations of unaugmented original images from each class within the mini-batch to generate a single synthesized feature representation through random linear interpolation. All synthesized representations for this mini-batch are then fed into the classification and loss layers to calculate an average classification loss that can markedly enhance intra-class cohesion. Moreover, our integrated solution seamlessly combines our intra-class mixup method with an existing mixup approach such as MixUp or Manifold MixUp. This comprehensive solution incorporates inter- and intra-class mixup in a balanced manner while concurrently improving intra-class cohesion and inter-class separability. Experimental results on six public datasets demonstrate that our integrated solution achieves a 0.1% to 3.43% higher accuracy than the best of either MixUp or our intra-class mixup method, averaging a 1.16% gain. It also outperforms the better performer of either Manifold MixUp or our intra-class mixup method by 0.12% to 5.16%, with an average gain of 1.11%.
comment: 25 pages,12 figures
☆ Learning causal graphs using variable grouping according to ancestral relationship
Several causal discovery algorithms have been proposed. However, when the sample size is small relative to the number of variables, the accuracy of estimating causal graphs using existing methods decreases. And some methods are not feasible when the sample size is smaller than the number of variables. To circumvent these problems, some researchers proposed causal structure learning algorithms using divide-and-conquer approaches. For learning the entire causal graph, the approaches first split variables into several subsets according to the conditional independence relationships among the variables, then apply a conventional causal discovery algorithm to each subset and merge the estimated results. Since the divide-and-conquer approach reduces the number of variables to which a causal structure learning algorithm is applied, it is expected to improve the estimation accuracy of causal graphs, especially when the sample size is small relative to the number of variables and the model is sparse. However, existing methods are either computationally expensive or do not provide sufficient accuracy when the sample size is small. This paper proposes a new algorithm for grouping variables based the ancestral relationships among the variables, under the LiNGAM assumption, where the causal relationships are linear, and the mutually independent noise are distributed as continuous non-Gaussian distributions. We call the proposed algorithm CAG. The time complexity of the ancestor finding in CAG is shown to be cubic to the number of variables. Extensive computer experiments confirm that the proposed method outperforms the original DirectLiNGAM without grouping variables and other divide-and-conquer approaches not only in estimation accuracy but also in computation time when the sample size is small relative to the number of variables and the model is sparse.
comment: 12 pages, 5 figures
☆ AI and Memory Wall
The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
comment: Published in IEEE Micro Journal
☆ Advancing IIoT with Over-the-Air Federated Learning: The Role of Iterative Magnitude Pruning
The industrial Internet of Things (IIoT) under Industry 4.0 heralds an era of interconnected smart devices where data-driven insights and machine learning (ML) fuse to revolutionize manufacturing. A noteworthy development in IIoT is the integration of federated learning (FL), which addresses data privacy and security among devices. FL enables edge sensors, also known as peripheral intelligence units (PIUs) to learn and adapt using their data locally, without explicit sharing of confidential data, to facilitate a collaborative yet confidential learning process. However, the lower memory footprint and computational power of PIUs inherently require deep neural network (DNN) models that have a very compact size. Model compression techniques such as pruning can be used to reduce the size of DNN models by removing unnecessary connections that have little impact on the model's performance, thus making the models more suitable for the limited resources of PIUs. Targeting the notion of compact yet robust DNN models, we propose the integration of iterative magnitude pruning (IMP) of the DNN model being trained in an over-the-air FL (OTA-FL) environment for IIoT. We provide a tutorial overview and also present a case study of the effectiveness of IMP in OTA-FL for an IIoT environment. Finally, we present future directions for enhancing and optimizing these deep compression techniques further, aiming to push the boundaries of IIoT capabilities in acquiring compact yet robust and high-performing DNN models.
comment: 6 pages, 6 figures
☆ C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion ICLR 2024
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration-a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data.
comment: ICLR 2024
☆ HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption ICML 2023
Transfer learning is a de facto standard method for efficiently training machine learning models for data-scarce problems by adding and fine-tuning new classification layers to a model pre-trained on large datasets. Although numerous previous studies proposed to use homomorphic encryption to resolve the data privacy issue in transfer learning in the machine learning as a service setting, most of them only focused on encrypted inference. In this study, we present HETAL, an efficient Homomorphic Encryption based Transfer Learning algorithm, that protects the client's privacy in training tasks by encrypting the client data using the CKKS homomorphic encryption scheme. HETAL is the first practical scheme that strictly provides encrypted training, adopting validation-based early stopping and achieving the accuracy of nonencrypted training. We propose an efficient encrypted matrix multiplication algorithm, which is 1.8 to 323 times faster than prior methods, and a highly precise softmax approximation algorithm with increased coverage. The experimental results for five well-known benchmark datasets show total training times of 567-3442 seconds, which is less than an hour.
comment: ICML 2023, Appendix D includes some updates after official publication
☆ Heuristic Algorithm-based Action Masking Reinforcement Learning (HAAM-RL) with Ensemble Inference Method
This paper presents a novel reinforcement learning (RL) approach called HAAM-RL (Heuristic Algorithm-based Action Masking Reinforcement Learning) for optimizing the color batching re-sequencing problem in automobile painting processes. The existing heuristic algorithms have limitations in adequately reflecting real-world constraints and accurately predicting logistics performance. Our methodology incorporates several key techniques including a tailored Markov Decision Process (MDP) formulation, reward setting including Potential-Based Reward Shaping, action masking using heuristic algorithms (HAAM-RL), and an ensemble inference method that combines multiple RL models. The RL agent is trained and evaluated using FlexSim, a commercial 3D simulation software, integrated with our RL MLOps platform BakingSoDA. Experimental results across 30 scenarios demonstrate that HAAM-RL with an ensemble inference method achieves a 16.25% performance improvement over the conventional heuristic algorithm, with stable and consistent results. The proposed approach exhibits superior performance and generalization capability, indicating its effectiveness in optimizing complex manufacturing processes. The study also discusses future research directions, including alternative state representations, incorporating model-based RL methods, and integrating additional real-world constraints.
comment: 7 pages, 8 figures
☆ DouRN: Improving DouZero by Residual Neural Networks
Deep reinforcement learning has made significant progress in games with imperfect information, but its performance in the card game Doudizhu (Chinese Poker/Fight the Landlord) remains unsatisfactory. Doudizhu is different from conventional games as it involves three players and combines elements of cooperation and confrontation, resulting in a large state and action space. In 2021, a Doudizhu program called DouZero\cite{zha2021douzero} surpassed previous models without prior knowledge by utilizing traditional Monte Carlo methods and multilayer perceptrons. Building on this work, our study incorporates residual networks into the model, explores different architectural designs, and conducts multi-role testing. Our findings demonstrate that this model significantly improves the winning rate within the same training time. Additionally, we introduce a call scoring system to assist the agent in deciding whether to become a landlord. With these enhancements, our model consistently outperforms the existing version of DouZero and even experienced human players. \footnote{The source code is available at \url{https://github.com/Yingchaol/Douzero_Resnet.git.}
☆ Text-Enhanced Data-free Approach for Federated Class-Incremental Learning CVPR 2024
Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal issue, involving the dynamic addition of new classes in the context of federated learning. In this field, Data-Free Knowledge Transfer (DFKT) plays a crucial role in addressing catastrophic forgetting and data privacy problems. However, prior approaches lack the crucial synergy between DFKT and the model training phases, causing DFKT to encounter difficulties in generating high-quality data from a non-anchored latent space of the old task model. In this paper, we introduce LANDER (Label Text Centered Data-Free Knowledge Transfer) to address this issue by utilizing label text embeddings (LTE) produced by pretrained language models. Specifically, during the model training phase, our approach treats LTE as anchor points and constrains the feature embeddings of corresponding training samples around them, enriching the surrounding area with more meaningful information. In the DFKT phase, by using these LTE anchors, LANDER can synthesize more meaningful samples, thereby effectively addressing the forgetting problem. Additionally, instead of tightly constraining embeddings toward the anchor, the Bounding Loss is introduced to encourage sample embeddings to remain flexible within a defined radius. This approach preserves the natural differences in sample embeddings and mitigates the embedding overlap caused by heterogeneous federated settings. Extensive experiments conducted on CIFAR100, Tiny-ImageNet, and ImageNet demonstrate that LANDER significantly outperforms previous methods and achieves state-of-the-art performance in FCIL. The code is available at https://github.com/tmtuan1307/lander.
comment: Accepted at CVPR 2024
☆ Carbon Footprint Reduction for Sustainable Data Centers in Real-Time
As machine learning workloads significantly increase energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a real-time controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions.
☆ Protein Conformation Generation via Force-Guided SE(3) Diffusion Models
The conformational landscape of proteins is crucial to understanding their functionality in complex biological processes. Traditional physics-based computational methods, such as molecular dynamics (MD) simulations, suffer from rare event sampling and long equilibration time problems, hindering their applications in general protein systems. Recently, deep generative modeling techniques, especially diffusion models, have been employed to generate novel protein conformations. However, existing score-based diffusion methods cannot properly incorporate important physical prior knowledge to guide the generation process, causing large deviations in the sampled protein conformations from the equilibrium distribution. In this paper, to overcome these limitations, we propose a force-guided SE(3) diffusion model, ConfDiff, for protein conformation generation. By incorporating a force-guided network with a mixture of data-based score models, ConfDiff can can generate protein conformations with rich diversity while preserving high fidelity. Experiments on a variety of protein conformation prediction tasks, including 12 fast-folding proteins and the Bovine Pancreatic Trypsin Inhibitor (BPTI), demonstrate that our method surpasses the state-of-the-art method.
☆ Learning-based Multi-continuum Model for Multiscale Flow Problems
Multiscale problems can usually be approximated through numerical homogenization by an equation with some effective parameters that can capture the macroscopic behavior of the original system on the coarse grid to speed up the simulation. However, this approach usually assumes scale separation and that the heterogeneity of the solution can be approximated by the solution average in each coarse block. For complex multiscale problems, the computed single effective properties/continuum might be inadequate. In this paper, we propose a novel learning-based multi-continuum model to enrich the homogenized equation and improve the accuracy of the single continuum model for multiscale problems with some given data. Without loss of generalization, we consider a two-continuum case. The first flow equation keeps the information of the original homogenized equation with an additional interaction term. The second continuum is newly introduced, and the effective permeability in the second flow equation is determined by a neural network. The interaction term between the two continua aligns with that used in the Dual-porosity model but with a learnable coefficient determined by another neural network. The new model with neural network terms is then optimized using trusted data. We discuss both direct back-propagation and the adjoint method for the PDE-constraint optimization problem. Our proposed learning-based multi-continuum model can resolve multiple interacted media within each coarse grid block and describe the mass transfer among them, and it has been demonstrated to significantly improve the simulation results through numerical experiments involving both linear and nonlinear flow equations.
☆ emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition
Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.
comment: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402
☆ Improving $Λ$ Signal Extraction with Domain Adaptation via Normalizing Flows SP
The present study presents a novel application for normalizing flows for domain adaptation. The study investigates the ability of flow based neural networks to improve signal extraction of $\Lambda$ Hyperons at CLAS12. Normalizing Flows can help model complex probability density functions that describe physics processes, enabling uses such as event generation. $\Lambda$ signal extraction has been improved through the use of classifier networks, but differences in simulation and data domains limit classifier performance; this study utilizes the flows for domain adaptation between Monte Carlo simulation and data. We were successful in training a flow network to transform between the latent physics space and a normal distribution. We also found that applying the flows lessened the dependence of the figure of merit on the cut on the classifier output, meaning that there was a broader range where the cut results in a similar figure of merit.
comment: Proceedings for the 25th International Spin Physics Symposium (SPIN 2023)
☆ M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval LREC
In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal retrieval performance. On the other hand, despite many retrieval datasets supporting various learning objectives beyond contrastive learning, combining them efficiently in multi-task learning scenarios can be challenging. In this paper, we introduce M3, an advanced recursive Multi-hop dense sentence retrieval system built upon a novel Multi-task Mixed-objective approach for dense text representation learning, addressing the aforementioned challenges. Our approach yields state-of-the-art performance on a large-scale open-domain fact verification benchmark dataset, FEVER. Code and data are available at: https://github.com/TonyBY/M3
comment: Accepted by LREC-COLING 2024
☆ Sampling Audit Evidence Using a Naive Bayes Classifier
Taiwan's auditors have suffered from processing excessive audit data, including drawing audit evidence. This study advances sampling techniques by integrating machine learning with sampling. This machine learning integration helps avoid sampling bias, keep randomness and variability, and target risker samples. We first classify data using a Naive Bayes classifier into some classes. Next, a user-based, item-based, or hybrid approach is employed to draw audit evidence. The representativeness index is the primary metric for measuring its representativeness. The user-based approach samples data symmetric around the median of a class as audit evidence. It may be equivalent to a combination of monetary and variable samplings. The item-based approach represents asymmetric sampling based on posterior probabilities for obtaining risky samples as audit evidence. It may be identical to a combination of non-statistical and monetary samplings. Auditors can hybridize those user-based and item-based approaches to balance representativeness and riskiness in selecting audit evidence. Three experiments show that sampling using machine learning integration has the benefits of drawing unbiased samples, handling complex patterns, correlations, and unstructured data, and improving efficiency in sampling big data. However, the limitations are the classification accuracy output by machine learning algorithms and the range of prior probabilities.
comment: 16 pages, 11 figures, 4 tables
☆ Automatic Outlier Rectification via Optimal Transport
In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for improvement. To address this limitation, we propose an automatic outlier rectification mechanism that integrates rectification and estimation within a joint optimization framework. We take the first step to utilize an optimal transport distance with a concave cost function to construct a rectification set in the space of probability distributions. Then, we select the best distribution within the rectification set to perform the estimation task. Notably, the concave cost function we introduced in this paper is the key to making our estimator effectively identify the outlier during the optimization process. We discuss the fundamental differences between our estimator and optimal transport-based distributionally robust optimization estimator. finally, we demonstrate the effectiveness and superiority of our approach over conventional approaches in extensive simulation and empirical analyses for mean estimation, least absolute regression, and the fitting of option implied volatility surfaces.
☆ WeatherProof: Leveraging Language Guidance for Semantic Segmentation in Adverse Weather
We propose a method to infer semantic segmentation maps from images captured under adverse weather conditions. We begin by examining existing models on images degraded by weather conditions such as rain, fog, or snow, and found that they exhibit a large performance drop as compared to those captured under clear weather. To control for changes in scene structures, we propose WeatherProof, the first semantic segmentation dataset with accurate clear and adverse weather image pairs that share an underlying scene. Through this dataset, we analyze the error modes in existing models and found that they were sensitive to the highly complex combination of different weather effects induced on the image during capture. To improve robustness, we propose a way to use language as guidance by identifying contributions of adverse weather conditions and injecting that as "side information". Models trained using our language guidance exhibit performance gains by up to 10.2% in mIoU on WeatherProof, up to 8.44% in mIoU on the widely used ACDC dataset compared to standard training techniques, and up to 6.21% in mIoU on the ACDC dataset as compared to previous SOTA methods.
comment: arXiv admin note: substantial text overlap with arXiv:2312.09534
☆ VidLA: Video-Language Alignment at Scale CVPR 2024
In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.
comment: Accepted to CVPR 2024
☆ Distribution-informed and wavelength-flexible data-driven photoacoustic oximetry
Significance: Photoacoustic imaging (PAI) promises to measure spatially-resolved blood oxygen saturation, but suffers from a lack of accurate and robust spectral unmixing methods to deliver on this promise. Accurate blood oxygenation estimation could have important clinical applications, from cancer detection to quantifying inflammation. Aim: This study addresses the inflexibility of existing data-driven methods for estimating blood oxygenation in PAI by introducing a recurrent neural network architecture. Approach: We created 25 simulated training dataset variations to assess neural network performance. We used a long short-term memory network to implement a wavelength-flexible network architecture and proposed the Jensen-Shannon divergence to predict the most suitable training dataset. Results: The network architecture can handle arbitrary input wavelengths and outperforms linear unmixing and the previously proposed learned spectral decolouring method. Small changes in the training data significantly affect the accuracy of our method, but we find that the Jensen-Shannon divergence correlates with the estimation error and is thus suitable for predicting the most appropriate training datasets for any given application. Conclusions: A flexible data-driven network architecture combined with the Jensen-Shannon Divergence to predict the best training data set provides a promising direction that might enable robust data-driven photoacoustic oximetry for clinical use cases.
comment: 37 pages, 7 figures
☆ Robust Model Based Reinforcement Learning Using $\mathcal{L}_1$ Adaptive Control
We introduce $\mathcal{L}_1$-MBRL, a control-theoretic augmentation scheme for Model-Based Reinforcement Learning (MBRL) algorithms. Unlike model-free approaches, MBRL algorithms learn a model of the transition function using data and use it to design a control input. Our approach generates a series of approximate control-affine models of the learned transition function according to the proposed switching law. Using the approximate model, control input produced by the underlying MBRL is perturbed by the $\mathcal{L}_1$ adaptive control, which is designed to enhance the robustness of the system against uncertainties. Importantly, this approach is agnostic to the choice of MBRL algorithm, enabling the use of the scheme with various MBRL algorithms. MBRL algorithms with $\mathcal{L}_1$ augmentation exhibit enhanced performance and sample efficiency across multiple MuJoCo environments, outperforming the original MBRL algorithms, both with and without system noise.
☆ iSpLib: A Library for Accelerating Graph Neural Networks using Auto-tuned Sparse Operations
Core computations in Graph Neural Network (GNN) training and inference are often mapped to sparse matrix operations such as sparse-dense matrix multiplication (SpMM). These sparse operations are harder to optimize by manual tuning because their performance depends significantly on the sparsity of input graphs, GNN models, and computing platforms. To address this challenge, we present iSpLib, a PyTorch-based C++ library equipped with auto-tuned sparse operations. iSpLib expedites GNN training with a cache-enabled backpropagation that stores intermediate matrices in local caches. The library offers a user-friendly Python plug-in that allows users to take advantage of our optimized PyTorch operations out-of-the-box for any existing linear algebra-based PyTorch implementation of popular GNNs (Graph Convolution Network, GraphSAGE, Graph Inference Network, etc.) with only two lines of additional code. We demonstrate that iSpLib obtains up to 27x overall training speedup compared to the equivalent PyTorch 2.1.0 and PyTorch Geometric 2.4.0 implementations on the CPU. Our library is publicly available at https://github.com/HipGraph/iSpLib (https://doi.org/10.5281/zenodo.10806511).
☆ Output-Constrained Lossy Source Coding With Application to Rate-Distortion-Perception Theory
The distortion-rate function of output-constrained lossy source coding with limited common randomness is analyzed for the special case of squared error distortion measure. An explicit expression is obtained when both source and reconstruction distributions are Gaussian. This further leads to a partial characterization of the information-theoretic limit of quadratic Gaussian rate-distortion-perception coding with the perception measure given by Kullback-Leibler divergence or squared quadratic Wasserstein distance.
☆ Learning WENO for entropy stable schemes to solve conservation laws
Entropy conditions play a crucial role in the extraction of a physically relevant solution for a system of conservation laws, thus motivating the construction of entropy stable schemes that satisfy a discrete analogue of such conditions. TeCNO schemes (Fjordholm et al. 2012) form a class of arbitrary high-order entropy stable finite difference solvers, which require specialized reconstruction algorithms satisfying the sign property at each cell interface. Recently, third-order WENO schemes called SP-WENO (Fjordholm and Ray, 2016) and SP-WENOc (Ray, 2018) have been designed to satisfy the sign property. However, these WENO algorithms can perform poorly near shocks, with the numerical solutions exhibiting large spurious oscillations. In the present work, we propose a variant of the SP-WENO, termed as Deep Sign-Preserving WENO (DSP-WENO), where a neural network is trained to learn the WENO weighting strategy. The sign property and third-order accuracy are strongly imposed in the algorithm, which constrains the WENO weight selection region to a convex polygon. Thereafter, a neural network is trained to select the WENO weights from this convex region with the goal of improving the shock-capturing capabilities without sacrificing the rate of convergence in smooth regions. The proposed synergistic approach retains the mathematical framework of the TeCNO scheme while integrating deep learning to remedy the computational issues of the WENO-based reconstruction. We present several numerical experiments to demonstrate the significant improvement with DSP-WENO over the existing variants of WENO satisfying the sign property.
comment: 38 pages, 16 figures, 5 tables
☆ Local Causal Discovery with Linear non-Gaussian Cyclic Models AISTATS 2024
Local causal discovery is of great practical significance, as there are often situations where the discovery of the global causal structure is unnecessary, and the interest lies solely on a single target variable. Most existing local methods utilize conditional independence relations, providing only a partially directed graph, and assume acyclicity for the ground-truth structure, even though real-world scenarios often involve cycles like feedback mechanisms. In this work, we present a general, unified local causal discovery method with linear non-Gaussian models, whether they are cyclic or acyclic. We extend the application of independent component analysis from the global context to independent subspace analysis, enabling the exact identification of the equivalent local directed structures and causal strengths from the Markov blanket of the target variable. We also propose an alternative regression-based method in the particular acyclic scenarios. Our identifiability results are empirically validated using both synthetic and real-world datasets.
comment: Appears at AISTATS 2024
☆ Model order reduction of deep structured state-space models: A system-theoretic approach
With a specific emphasis on control design objectives, achieving accurate system modeling with limited complexity is crucial in parametric system identification. The recently introduced deep structured state-space models (SSM), which feature linear dynamical blocks as key constituent components, offer high predictive performance. However, the learned representations often suffer from excessively large model orders, which render them unsuitable for control design purposes. The current paper addresses this challenge by means of system-theoretic model order reduction techniques that target the linear dynamical blocks of SSMs. We introduce two regularization terms which can be incorporated into the training loss for improved model order reduction. In particular, we consider modal $\ell_1$ and Hankel nuclear norm regularization to promote sparsity, allowing one to retain only the relevant states without sacrificing accuracy. The presented regularizers lead to advantages in terms of parsimonious representations and faster inference resulting from the reduced order models. The effectiveness of the proposed methodology is demonstrated using real-world ground vibration data from an aircraft.
☆ Deep Clustering Evaluation: How to Validate Internal Clustering Validation Measures
Deep clustering, a method for partitioning complex, high-dimensional data using deep neural networks, presents unique evaluation challenges. Traditional clustering validation measures, designed for low-dimensional spaces, are problematic for deep clustering, which involves projecting data into lower-dimensional embeddings before partitioning. Two key issues are identified: 1) the curse of dimensionality when applying these measures to raw data, and 2) the unreliable comparison of clustering results across different embedding spaces stemming from variations in training procedures and parameter settings in different clustering models. This paper addresses these challenges in evaluating clustering quality in deep learning. We present a theoretical framework to highlight ineffectiveness arising from using internal validation measures on raw and embedded data and propose a systematic approach to applying clustering validity indices in deep clustering contexts. Experiments show that this framework aligns better with external validation measures, effectively reducing the misguidance from the improper use of clustering validity indices in deep learning.
♻ ☆ TD-MPC2: Scalable, Robust World Models for Continuous Control ICLR 2024
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com
comment: ICLR 2024. Explore videos, models, data, code, and more at https://tdmpc2.com
♻ ☆ Emergent Dominance Hierarchies in Reinforcement Learning Agents
Modern Reinforcement Learning (RL) algorithms are able to outperform humans in a wide variety of tasks. Multi-agent reinforcement learning (MARL) settings present additional challenges, and successful cooperation in mixed-motive groups of agents depends on a delicate balancing act between individual and group objectives. Social conventions and norms, often inspired by human institutions, are used as tools for striking this balance. In this paper, we examine a fundamental, well-studied social convention that underlies cooperation in both animal and human societies: dominance hierarchies. We adapt the ethological theory of dominance hierarchies to artificial agents, borrowing the established terminology and definitions with as few amendments as possible. We demonstrate that populations of RL agents, operating without explicit programming or intrinsic rewards, can invent, learn, enforce, and transmit a dominance hierarchy to new populations. The dominance hierarchies that emerge have a similar structure to those studied in chickens, mice, fish, and other species.
♻ ☆ Unraveling the Mystery of Scaling Laws: Part I
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.
♻ ☆ A Geospatial Approach to Predicting Desert Locust Breeding Grounds in Africa
Desert locust swarms present a major threat to agriculture and food security. Addressing this challenge, our study develops an operationally-ready model for predicting locust breeding grounds, which has the potential to enhance early warning systems and targeted control measures. We curated a dataset from the United Nations Food and Agriculture Organization's (UN-FAO) locust observation records and analyzed it using two types of spatio-temporal input features: remotely-sensed environmental and climate data as well as multi-spectral earth observation images. Our approach employed custom deep learning models (three-dimensional and LSTM-based recurrent convolutional networks), along with the geospatial foundational model Prithvi recently released by Jakubik et al., 2023. These models notably outperformed existing baselines, with the Prithvi-based model, fine-tuned on multi-spectral images from NASA's Harmonized Landsat and Sentinel-2 (HLS) dataset, achieving the highest accuracy, F1 and ROC-AUC scores (83.03%, 81.53% and 87.69%, respectively). A significant finding from our research is that multi-spectral earth observation images alone are sufficient for effective locust breeding ground prediction without the need to explicitly incorporate climatic or environmental features.
♻ ☆ MedMamba: Vision Mamba for Medical Image Classification
Medical image classification is a very fundamental and crucial task in the field of computer vision. These years, CNN-based and Transformer-based models have been widely used to classify various medical images. Unfortunately, The limitation of CNNs in long-range modeling capabilities prevents them from effectively extracting features in medical images, while Transformers are hampered by their quadratic computational complexity. Recent research has shown that the state space model (SSM) represented by Mamba can efficiently model long-range interactions while maintaining linear computational complexity. Inspired by this, we propose Vision Mamba for medical image classification (MedMamba). More specifically, we introduce a novel Conv-SSM module. Conv-SSM combines the local feature extraction ability of convolutional layers with the ability of SSM to capture long-range dependency, thereby modeling medical images with different modalities. To demonstrate the potential of MedMamba, we conducted extensive experiments using 14 publicly available medical datasets with different imaging techniques and two private datasets built by ourselves. Extensive experimental results demonstrate that the proposed MedMamba performs well in detecting lesions in various medical images. To the best of our knowledge, this is the first Vision Mamba tailored for medical image classification. The purpose of this work is to establish a new baseline for medical image classification tasks and provide valuable insights for the future development of more efficient and effective SSM-based artificial intelligence algorithms and application systems in the medical. Source code has been available at https://github.com/YubiaoYue/MedMamba.
♻ ☆ Let's do the time-warp-attend: Learning topological invariants of dynamical systems
Dynamical systems across the sciences, from electrical circuits to ecological networks, undergo qualitative and often catastrophic changes in behavior, called bifurcations, when their underlying parameters cross a threshold. Existing methods predict oncoming catastrophes in individual systems but are primarily time-series-based and struggle both to categorize qualitative dynamical regimes across diverse systems and to generalize to real data. To address this challenge, we propose a data-driven, physically-informed deep-learning framework for classifying dynamical regimes and characterizing bifurcation boundaries based on the extraction of topologically invariant features. We focus on the paradigmatic case of the supercritical Hopf bifurcation, which is used to model periodic dynamics across a wide range of applications. Our convolutional attention method is trained with data augmentations that encourage the learning of topological invariants which can be used to detect bifurcation boundaries in unseen systems and to design models of biological systems like oscillatory gene regulatory networks. We further demonstrate our method's use in analyzing real data by recovering distinct proliferation and differentiation dynamics along pancreatic endocrinogenesis trajectory in gene expression space based on single-cell data. Our method provides valuable insights into the qualitative, long-term behavior of a wide range of dynamical systems, and can detect bifurcations or catastrophic transitions in large-scale physical and biological systems.
♻ ☆ QuATON: Quantization Aware Training of Optical Neurons
Optical processors, built with "optical neurons", can efficiently perform high-dimensional linear operations at the speed of light. Thus they are a promising avenue to accelerate large-scale linear computations. With the current advances in micro-fabrication, such optical processors can now be 3D fabricated, but with a limited precision. This limitation translates to quantization of learnable parameters in optical neurons, and should be handled during the design of the optical processor in order to avoid a model mismatch. Specifically, optical neurons should be trained or designed within the physical-constraints at a predefined quantized precision level. To address this critical issues we propose a physics-informed quantization-aware training framework. Our approach accounts for physical constraints during the training process, leading to robust designs. We demonstrate that our approach can design state of the art optical processors using diffractive networks for multiple physics based tasks despite quantized learnable parameters. We thus lay the foundation upon which improved optical processors may be 3D fabricated in the future.
♻ ☆ Collaborative Distributed Machine Learning
Various collaborative distributed machine learning (CDML) systems, including federated learning systems and swarm learning systems, with different key traits were developed to leverage resources for development and use of machine learning (ML) models in a confidentiality-preserving way. To meet use case requirements, suitable CDML systems need to be selected. However, comparison between CDML systems regarding their suitability for use cases is often difficult. This work presents a CDML system conceptualization and CDML archetypes to support comparison of CDML systems and introduce scientific and practical audiences to the principal functioning and key traits of CDML systems.
♻ ☆ Assessing the Causal Impact of Humanitarian Aid on Food Security
In the face of climate change-induced droughts, vulnerable regions encounter severe threats to food security, demanding urgent humanitarian assistance. This paper introduces a causal inference framework for the Horn of Africa, aiming to assess the impact of cash-based interventions on food crises. Our contributions include identifying causal relationships within the food security system, harmonizing a comprehensive database including socio-economic, weather and remote sensing data, and estimating the causal effect of humanitarian interventions on malnutrition. On a country level, our results revealed no significant effects, likely due to limited sample size, suboptimal data quality, and an imperfect causal graph resulting from our limited understanding of multidisciplinary systems like food security. Instead, on a district level, results revealed significant effects, further implying the context-specific nature of the system. This underscores the need to enhance data collection and refine causal models with domain experts for more effective future interventions and policies, improving transparency and accountability in humanitarian aid.
comment: Accepted for publication and presentation at the International Geoscience and Remote Sensing Symposium (IGARSS) 2024
♻ ☆ Learning a Depth Covariance Function CVPR 2023
We propose learning a depth covariance function with applications to geometric vision tasks. Given RGB images as input, the covariance function can be flexibly used to define priors over depth functions, predictive distributions given observations, and methods for active point selection. We leverage these techniques for a selection of downstream tasks: depth completion, bundle adjustment, and monocular dense visual odometry.
comment: CVPR 2023. Project page: https://edexheim.github.io/DepthCov/
♻ ☆ TMI! Finetuned Models Leak Private Information from their Pretraining Data
Transfer learning has become an increasingly popular technique in machine learning as a way to leverage a pretrained model trained for one task to assist with building a finetuned model for a related task. This paradigm has been especially popular for $\textit{privacy}$ in machine learning, where the pretrained model is considered public, and only the data for finetuning is considered sensitive. However, there are reasons to believe that the data used for pretraining is still sensitive, making it essential to understand how much information the finetuned model leaks about the pretraining data. In this work we propose a new membership-inference threat model where the adversary only has access to the finetuned model and would like to infer the membership of the pretraining data. To realize this threat model, we implement a novel metaclassifier-based attack, $\textbf{TMI}$, that leverages the influence of memorized pretraining samples on predictions in the downstream task. We evaluate $\textbf{TMI}$ on both vision and natural language tasks across multiple transfer learning settings, including finetuning with differential privacy. Through our evaluation, we find that $\textbf{TMI}$ can successfully infer membership of pretraining examples using query access to the finetuned model. An open-source implementation of $\textbf{TMI}$ can be found $\href{https://github.com/johnmath/tmi-pets24}{\text{on GitHub}}$.
♻ ☆ Closing the Gap: Achieving Better Accuracy-Robustness Tradeoffs against Query-Based Attacks AAAI
Although promising, existing defenses against query-based attacks share a common limitation: they offer increased robustness against attacks at the price of a considerable accuracy drop on clean samples. In this work, we show how to efficiently establish, at test-time, a solid tradeoff between robustness and accuracy when mitigating query-based attacks. Given that these attacks necessarily explore low-confidence regions, our insight is that activating dedicated defenses, such as random noise defense and random image transformations, only for low-confidence inputs is sufficient to prevent them. Our approach is independent of training and supported by theory. We verify the effectiveness of our approach for various existing defenses by conducting extensive experiments on CIFAR-10, CIFAR-100, and ImageNet. Our results confirm that our proposal can indeed enhance these defenses by providing better tradeoffs between robustness and accuracy when compared to state-of-the-art approaches while being completely training-free.
comment: To appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2024
♻ ☆ EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
comment: Project website: https://zjunlp.github.io/project/EasyInstruct Code: https://github.com/zjunlp/EasyInstruct Video: https://youtu.be/rfQOWYfziFo Demo: https://huggingface.co/spaces/zjunlp/EasyInstruct
♻ ☆ Exact and general decoupled solutions of the LMC Multitask Gaussian Process model UAI
The Linear Model of Co-regionalization (LMC) is a very general model of multitask gaussian process for regression or classification. While its expressivity and conceptual simplicity are appealing, naive implementations have cubic complexity in the number of datapoints and number of tasks, making approximations mandatory for most applications. However, recent work has shown that under some conditions the latent processes of the model can be decoupled, leading to a complexity that is only linear in the number of said processes. We here extend these results, showing from the most general assumptions that the only condition necessary to an efficient exact computation of the LMC is a mild hypothesis on the noise model. We introduce a full parametrization of the resulting \emph{projected LMC} model, and an expression of the marginal likelihood enabling efficient optimization. We perform a parametric study on synthetic data to show the excellent performance of our approach, compared to an unrestricted exact LMC and approximations of the latter. Overall, the projected LMC appears as a credible and simpler alternative to state-of-the art models, which greatly facilitates some computations such as leave-one-out cross-validation and fantasization.
comment: 29 pages, 10 figures, submitted to UAI
♻ ☆ On the Privacy of Selection Mechanisms with Gaussian Noise AISTATS 2024
Report Noisy Max and Above Threshold are two classical differentially private (DP) selection mechanisms. Their output is obtained by adding noise to a sequence of low-sensitivity queries and reporting the identity of the query whose (noisy) answer satisfies a certain condition. Pure DP guarantees for these mechanisms are easy to obtain when Laplace noise is added to the queries. On the other hand, when instantiated using Gaussian noise, standard analyses only yield approximate DP guarantees despite the fact that the outputs of these mechanisms lie in a discrete space. In this work, we revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise and show that, under the additional assumption that the underlying queries are bounded, it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold. The resulting bounds are tight and depend on closed-form expressions that can be numerically evaluated using standard methods. Empirically we find these lead to tighter privacy accounting in the high privacy, low data regime. Further, we propose a simple privacy filter for composing pure ex-post DP guarantees, and use it to derive a fully adaptive Gaussian Sparse Vector Technique mechanism. Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our Sparse Vector Technique is practically competitive with previous approaches and requires less hyper-parameter tuning.
comment: AISTATS 2024
♻ ☆ Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training CVPR 2024
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.
comment: Accepted at CVPR 2024. 13 pages, 4 figures
♻ ☆ AI-KD: Adversarial learning and Implicit regularization for self-Knowledge Distillation
We present a novel adversarial penalized self-knowledge distillation method, named adversarial learning and implicit regularization for self-knowledge distillation (AI-KD), which regularizes the training procedure by adversarial learning and implicit distillations. Our model not only distills the deterministic and progressive knowledge which are from the pre-trained and previous epoch predictive probabilities but also transfers the knowledge of the deterministic predictive distributions using adversarial learning. The motivation is that the self-knowledge distillation methods regularize the predictive probabilities with soft targets, but the exact distributions may be hard to predict. Our method deploys a discriminator to distinguish the distributions between the pre-trained and student models while the student model is trained to fool the discriminator in the trained procedure. Thus, the student model not only can learn the pre-trained model's predictive probabilities but also align the distributions between the pre-trained and student models. We demonstrate the effectiveness of the proposed method with network architectures on multiple datasets and show the proposed method achieves better performance than state-of-the-art methods.
comment: Accepted to KBS
♻ ☆ Sequence-to-Sequence Spanish Pre-trained Language Models LREC
In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.
comment: Accepted paper at LREC-Coling2024
♻ ☆ Effective Structured Prompting by Meta-Learning and Representative Verbalizer ICML 2023
Prompt tuning for pre-trained masked language models (MLM) has shown promising performance in natural language processing tasks with few labeled examples. It tunes a prompt for the downstream task, and a verbalizer is used to bridge the predicted token and label prediction. Due to the limited training data, prompt initialization is crucial for prompt tuning. Recently, MetaPrompting (Hou et al., 2022) uses meta-learning to learn a shared initialization for all task-specific prompts. However, a single initialization is insufficient to obtain good prompts for all tasks and samples when the tasks are complex. Moreover, MetaPrompting requires tuning the whole MLM, causing a heavy burden on computation and memory as the MLM is usually large. To address these issues, we use a prompt pool to extract more task knowledge and construct instance-dependent prompts via attention. We further propose a novel soft verbalizer (RepVerb) which constructs label embedding from feature embeddings directly. Combining meta-learning the prompt pool and RepVerb, we propose MetaPrompter for effective structured prompting. MetaPrompter is parameter-efficient as only the pool is required to be tuned. Experimental results demonstrate that MetaPrompter performs better than the recent state-of-the-arts and RepVerb outperforms existing soft verbalizers.
comment: Accepted at ICML 2023
♻ ☆ Toward a Theory of Causation for Interpreting Neural Code Models
Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs.
♻ ☆ Consistency Enhancement-Based Deep Multiview Clustering via Contrastive Learning
Multiview clustering (MVC) segregates data samples into meaningful clusters by synthesizing information across multiple views. Moreover, deep learning-based methods have demonstrated their strong feature learning capabilities in MVC scenarios. However, effectively generalizing feature representations while maintaining consistency is still an intractable problem. In addition, most existing deep clustering methods based on contrastive learning overlook the consistency of the clustering representations during the clustering process. In this paper, we show how the above problems can be overcome and propose a consistent enhancement-based deep MVC method via contrastive learning (CCEC). Specifically, semantic connection blocks are incorporated into a feature representation to preserve the consistent information among multiple views. Furthermore, the representation process for clustering is enhanced through spectral clustering, and the consistency across multiple views is improved. Experiments conducted on five datasets demonstrate the effectiveness and superiority of our method in comparison with the state-of-the-art (SOTA) methods. The code for this method can be accessed at https://anonymous.4open.science/r/CCEC-E84E/.
comment: There are multiple errors that need to be corrected, including some formulas and concept descriptions. We will re upload the paper after the modifications are completed
♻ ☆ Adversarial Attacks and Defenses in Automated Control Systems: A Comprehensive Benchmark
Integrating machine learning into Automated Control Systems (ACS) enhances decision-making in industrial process management. One of the limitations to the widespread adoption of these technologies in industry is the vulnerability of neural networks to adversarial attacks. This study explores the threats in deploying deep learning models for fault diagnosis in ACS using the Tennessee Eastman Process dataset. By evaluating three neural networks with different architectures, we subject them to six types of adversarial attacks and explore five different defense methods. Our results highlight the strong vulnerability of models to adversarial samples and the varying effectiveness of defense strategies. We also propose a novel protection approach by combining multiple defense methods and demonstrate it's efficacy. This research contributes several insights into securing machine learning within ACS, ensuring robust fault diagnosis in industrial processes.
♻ ☆ Learning to Solve Integer Linear Programs with Davis-Yin Splitting
In many applications, a combinatorial problem must be repeatedly solved with similar, but distinct parameters. Yet, the parameters $w$ are not directly observed; only contextual data $d$ that correlates with $w$ is available. It is tempting to use a neural network to predict $w$ given $d$. However, training such a model requires reconciling the discrete nature of combinatorial optimization with the gradient-based frameworks used to train neural networks. When the problem in question is an Integer Linear Program (ILP), one approach to overcome this training issue is to consider a continuous relaxation of the combinatorial problem. While existing methods utilizing this approach have shown to be highly effective on small problems, they do not always scale well to large problems. In this work, we draw on ideas from modern convex optimization to design a network and training scheme which scales effortlessly to problems with thousands of variables. Our experiments verify the computational advantage our proposed method enjoys on two representative problems, namely the shortest path problem and the knapsack problem.
♻ ☆ LLM4SGG: Large Language Model for Weakly Supervised Scene Graph Generation CVPR 2024
Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.
comment: 8 pages; CVPR 2024
♻ ☆ Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel ICLR 2024
We propose conditional flows of the maximum mean discrepancy (MMD) with the negative distance kernel for posterior sampling and conditional generative modeling. This MMD, which is also known as energy distance, has several advantageous properties like efficient computation via slicing and sorting. We approximate the joint distribution of the ground truth and the observations using discrete Wasserstein gradient flows and establish an error bound for the posterior distributions. Further, we prove that our particle flow is indeed a Wasserstein gradient flow of an appropriate functional. The power of our method is demonstrated by numerical examples including conditional image generation and inverse problems like superresolution, inpainting and computed tomography in low-dose and limited-angle settings.
comment: Published as a conference paper at ICLR 2024
♻ ☆ On the convergence of loss and uncertainty-based active learning algorithms
We consider the convergence rates of loss and uncertainty-based active learning algorithms under various assumptions. Firstly, we establish a set of conditions that ensure convergence rates when applied to linear classifiers and linearly separable datasets. This includes demonstrating convergence rate guarantees for loss-based sampling with various loss functions. Secondly, we introduce a framework that allows us to derive convergence rate bounds for loss-based sampling by leveraging known convergence rate bounds for stochastic gradient descent algorithms. Lastly, we propose a new algorithm that combines point sampling and stochastic Polyak's step size. We establish a condition on the sampling process, ensuring a convergence rate guarantee for this algorithm, particularly in the case of smooth convex loss functions. Our numerical results showcase the efficiency of the proposed algorithm.
♻ ☆ RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization NeurIPS 2023
Multi-agent systems are characterized by environmental uncertainty, varying policies of agents, and partial observability, which result in significant risks. In the context of Multi-Agent Reinforcement Learning (MARL), learning coordinated and decentralized policies that are sensitive to risk is challenging. To formulate the coordination requirements in risk-sensitive MARL, we introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles. This principle requires that the collection of risk-sensitive action selections of each agent should be equivalent to the risk-sensitive action selection of the central policy. Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. Therefore, we propose RiskQ to address this limitation, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics. We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available in https://github.com/xmu-rl-3dv/RiskQ.
comment: Accepted at NeurIPS 2023
♻ ☆ Neural Wasserstein Gradient Flows for Maximum Mean Discrepancies with Riesz Kernels ICML 2023
Wasserstein gradient flows of maximum mean discrepancy (MMD) functionals with non-smooth Riesz kernels show a rich structure as singular measures can become absolutely continuous ones and conversely. In this paper we contribute to the understanding of such flows. We propose to approximate the backward scheme of Jordan, Kinderlehrer and Otto for computing such Wasserstein gradient flows as well as a forward scheme for so-called Wasserstein steepest descent flows by neural networks (NNs). Since we cannot restrict ourselves to absolutely continuous measures, we have to deal with transport plans and velocity plans instead of usual transport maps and velocity fields. Indeed, we approximate the disintegration of both plans by generative NNs which are learned with respect to appropriate loss functions. In order to evaluate the quality of both neural schemes, we benchmark them on the interaction energy. Here we provide analytic formulas for Wasserstein schemes starting at a Dirac measure and show their convergence as the time step size tends to zero. Finally, we illustrate our neural MMD flows by numerical examples.
comment: Accepted at ICML 2023
♻ ☆ Graph Ranking Contrastive Learning: A Extremely Simple yet Efficient Method
Graph contrastive learning (GCL) has emerged as a representative graph self-supervised method, achieving significant success. The currently prevalent optimization objective for GCL is InfoNCE. Typically, it employs augmentation techniques to obtain two views, where a node in one view acts as the anchor, the corresponding node in the other view serves as the positive sample, and all other nodes are regarded as negative samples. The goal is to minimize the distance between the anchor node and positive samples and maximize the distance to negative samples. However, due to the lack of label information during training, InfoNCE inevitably treats samples from the same class as negative samples, leading to the issue of false negative samples. This can impair the learned node representations and subsequently hinder performance in downstream tasks. While numerous methods have been proposed to mitigate the impact of false negatives, they still face various challenges. For instance, while increasing the number of negative samples can dilute the impact of false negatives, it concurrently increases computational burden. Thus, we propose GraphRank, a simple yet efficient graph contrastive learning method that addresses the problem of false negative samples by redefining the concept of negative samples to a certain extent, thereby avoiding the issue of false negative samples. The effectiveness of GraphRank is empirically validated through experiments on the node, edge, and graph level tasks.
♻ ☆ Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints
Learning to defer (L2D) aims to improve human-AI collaboration systems by learning how to defer decisions to humans when they are more likely to be correct than an ML classifier. Existing research in L2D overlooks key aspects of real-world systems that impede its practical adoption, namely: i) neglecting cost-sensitive scenarios, where type 1 and type 2 errors have different costs; ii) requiring concurrent human predictions for every instance of the training dataset and iii) not dealing with human work capacity constraints. To address these issues, we propose the deferral under cost and capacity constraints framework (DeCCaF). DeCCaF is a novel L2D approach, employing supervised learning to model the probability of human error under less restrictive data requirements (only one expert prediction per instance) and using constraint programming to globally minimize the error cost subject to workload limitations. We test DeCCaF in a series of cost-sensitive fraud detection scenarios with different teams of 9 synthetic fraud analysts, with individual work capacity constraints. The results demonstrate that our approach performs significantly better than the baselines in a wide array of scenarios, achieving an average 8.4% reduction in the misclassification cost.
♻ ☆ Mpox-AISM: AI-Mediated Super Monitoring for Mpox and Like-Mpox
The key to preventing the spread of mpox (monkeypox) lies in timely, convenient, and accurate diagnosis for earlier-stage infected individuals. Unfortunately, the resemblances between common skin diseases and mpox and the need for professional diagnosis inevitably deteriorated the diagnosis of earlier-stage patients with Mpox and contributed to its widespread outbreak in crowded areas. Here, we proposed a real-time visualization strategy called "Super Monitoring" using artificial intelligence and Internet technology, thereby performing a low-cost, convenient, timely, and unspecialized diagnosis for earlier-stage mpox. Specifically, such AI-mediated "super monitoring" (Mpox-AISM) invokes a framework assembled by deep learning models, data augmentation, self-supervised learning, and cloud services. Verified by publicly available datasets, the Precision, Recall, Specificity, and F1-score of Mpox-AISM in diagnosing mpox achieved 99.3%, 94.1%, 99.9%, and 96.6%, respectively. Furthermore, Mpox-AISM's overall accuracy reaches 94.51% in diagnosing mpox, six like-mpox skin diseases, and normal skin. We also employed gradient-weighted class activation mapping to explain the decision-making process of Mpox-AISM, thus handily understanding the specific characteristics that may indicate the mpox's onset and improving its reliability. With the help of the Internet and communication terminal, Mpox-AISM can perform a real-time, low-cost, and convenient diagnosis for earlier-stage mpox in various real-world settings, thereby effectively curbing the spread of mpox virus.
♻ ☆ DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.
♻ ☆ The Role of Transparency in Repeated First-Price Auctions with Unknown Valuations STOC 2024
We study the problem of regret minimization for a single bidder in a sequence of first-price auctions where the bidder discovers the item's value only if the auction is won. Our main contribution is a complete characterization, up to logarithmic factors, of the minimax regret in terms of the auction's \emph{transparency}, which controls the amount of information on competing bids disclosed by the auctioneer at the end of each auction. Our results hold under different assumptions (stochastic, adversarial, and their smoothed variants) on the environment generating the bidder's valuations and competing bids. These minimax rates reveal how the interplay between transparency and the nature of the environment affects how fast one can learn to bid optimally in first-price auctions.
comment: Accepted at STOC 2024
♻ ☆ SLIM: Skill Learning with Multiple Critics ICRA 2024
Self-supervised skill learning aims to acquire useful behaviors that leverage the underlying dynamics of the environment. Latent variable models, based on mutual information maximization, have been successful in this task but still struggle in the context of robotic manipulation. As it requires impacting a possibly large set of degrees of freedom composing the environment, mutual information maximization fails alone in producing useful and safe manipulation behaviors. Furthermore, tackling this by augmenting skill discovery rewards with additional rewards through a naive combination might fail to produce desired behaviors. To address this limitation, we introduce SLIM, a multi-critic learning approach for skill discovery with a particular focus on robotic manipulation. Our main insight is that utilizing multiple critics in an actor-critic framework to gracefully combine multiple reward functions leads to a significant improvement in latent-variable skill discovery for robotic manipulation while overcoming possible interference occurring among rewards which hinders convergence to useful skills. Furthermore, in the context of tabletop manipulation, we demonstrate the applicability of our novel skill discovery approach to acquire safe and efficient motor primitives in a hierarchical reinforcement learning fashion and leverage them through planning, significantly surpassing baseline approaches for skill discovery.
comment: Accepted at IEEE ICRA 2024
♻ ☆ From Tempered to Benign Overfitting in ReLU Neural Networks NeurIPS 2023
Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.
comment: NeurIPS 2023; fixed bug
♻ ☆ FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer CVPR 2024
The success of a specific neural network architecture is closely tied to the dataset and task it tackles; there is no one-size-fits-all solution. Thus, considerable efforts have been made to quickly and accurately estimate the performances of neural architectures, without full training or evaluation, for given tasks and datasets. Neural architecture encoding has played a crucial role in the estimation, and graphbased methods, which treat an architecture as a graph, have shown prominent performance. For enhanced representation learning of neural architectures, we introduce FlowerFormer, a powerful graph transformer that incorporates the information flows within a neural architecture. FlowerFormer consists of two key components: (a) bidirectional asynchronous message passing, inspired by the flows; (b) global attention built on flow-based masking. Our extensive experiments demonstrate the superiority of FlowerFormer over existing neural encoding methods, and its effectiveness extends beyond computer vision models to include graph neural networks and auto speech recognition models. Our code is available at http://github.com/y0ngjaenius/CVPR2024_FLOWERFormer.
comment: CVPR 2024 Camera-Ready
♻ ☆ Deep Classifier Mimicry without Data Access
Access to pre-trained models has recently emerged as a standard across numerous machine learning domains. Unfortunately, access to the original data the models were trained on may not equally be granted. This makes it tremendously challenging to fine-tune, compress models, adapt continually, or to do any other type of data-driven update. We posit that original data access may however not be required. Specifically, we propose Contrastive Abductive Knowledge Extraction (CAKE), a model-agnostic knowledge distillation procedure that mimics deep classifiers without access to the original data. To this end, CAKE generates pairs of noisy synthetic samples and diffuses them contrastively toward a model's decision boundary. We empirically corroborate CAKE's effectiveness using several benchmark datasets and various architectural choices, paving the way for broad application.
comment: 11 pages main, 4 figures, 2 tables, 4 pages appendix
♻ ☆ Generalized Early Stopping in Evolutionary Direct Policy Search
Lengthy evaluation times are common in many optimization problems such as direct policy search tasks, especially when they involve conducting evaluations in the physical world, e.g. in robotics applications. Often when evaluating solution over a fixed time period it becomes clear that the objective value will not increase with additional computation time (for example when a two wheeled robot continuously spins on the spot). In such cases, it makes sense to stop the evaluation early to save computation time. However, most approaches to stop the evaluation are problem specific and need to be specifically designed for the task at hand. Therefore, we propose an early stopping method for direct policy search. The proposed method only looks at the objective value at each time step and requires no problem specific knowledge. We test the introduced stopping criterion in five direct policy search environments drawn from games, robotics and classic control domains, and show that it can save up to 75% of the computation time. We also compare it with problem specific stopping criteria and show that it performs comparably, while being more generally applicable.
♻ ☆ TensorBank: Tensor Lakehouse for Foundation Model Training
Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
♻ ☆ Differentially Private Linear Bandits with Partial Distributed Feedback
In this paper, we study the problem of global reward maximization with only partial distributed feedback. This problem is motivated by several real-world applications (e.g., cellular network configuration, dynamic pricing, and policy selection) where an action taken by a central entity influences a large population that contributes to the global reward. However, collecting such reward feedback from the entire population not only incurs a prohibitively high cost but often leads to privacy concerns. To tackle this problem, we consider differentially private distributed linear bandits, where only a subset of users from the population are selected (called clients) to participate in the learning process and the central server learns the global model from such partial feedback by iteratively aggregating these clients' local feedback in a differentially private fashion. We then propose a unified algorithmic learning framework, called differentially private distributed phased elimination (DP-DPE), which can be naturally integrated with popular differential privacy (DP) models (including central DP, local DP, and shuffle DP). Furthermore, we prove that DP-DPE achieves both sublinear regret and sublinear communication cost. Interestingly, DP-DPE also achieves privacy protection ``for free'' in the sense that the additional cost due to privacy guarantees is a lower-order additive term. In addition, as a by-product of our techniques, the same results of ``free" privacy can also be achieved for the standard differentially private linear bandits. Finally, we conduct simulations to corroborate our theoretical results and demonstrate the effectiveness of DP-DPE.
comment: 69 pages, this version is an extension from the preliminary one presented at IEEE/IFIP WiOpt 2022 and was accepted to IEEE Transactions on Network Science and Engineering (TNSE)
♻ ☆ Weighted least-squares approximation with determinantal point processes and generalized volume sampling
We consider the problem of approximating a function from $L^2$ by an element of a given $m$-dimensional space $V_m$, associated with some feature map $\varphi$, using evaluations of the function at random points $x_1,\dots,x_n$. After recalling some results on optimal weighted least-squares using independent and identically distributed points, we consider weighted least-squares using projection determinantal point processes (DPP) or volume sampling. These distributions introduce dependence between the points that promotes diversity in the selected features $\varphi(x_i)$. We first provide a generalized version of volume-rescaled sampling yielding quasi-optimality results in expectation with a number of samples $n = O(m\log(m))$, that means that the expected $L^2$ error is bounded by a constant times the best approximation error in $L^2$. Also, further assuming that the function is in some normed vector space $H$ continuously embedded in $L^2$, we further prove that the approximation is almost surely bounded by the best approximation error measured in the $H$-norm. This includes the cases of functions from $L^\infty$ or reproducing kernel Hilbert spaces. Finally, we present an alternative strategy consisting in using independent repetitions of projection DPP (or volume sampling), yielding similar error bounds as with i.i.d. or volume sampling, but in practice with a much lower number of samples. Numerical experiments illustrate the performance of the different strategies.
comment: In this second version, conjecture (13) on DPP and (16) on volume sampling have been modified, including a convexity requirement. Proofs of propositions 5.4 and 5.12 have been modified accordingly. Remarks 5.5 and 5.6 have been added to discuss alternatives to conjecture (13) on DPP
♻ ☆ PyVRP: a high-performance VRP solver package
We introduce PyVRP, a Python package that implements hybrid genetic search in a state-of-the-art vehicle routing problem (VRP) solver. The package is designed for the VRP with time windows (VRPTW), but can be easily extended to support other VRP variants. PyVRP combines the flexibility of Python with the performance of C++, by implementing (only) performance critical parts of the algorithm in C++, while being fully customisable at the Python level. PyVRP is a polished implementation of the algorithm that ranked 1st in the 2021 DIMACS VRPTW challenge and, after improvements, ranked 1st on the static variant of the EURO meets NeurIPS 2022 vehicle routing competition. The code follows good software engineering practices, and is well-documented and unit tested. PyVRP is freely available under the liberal MIT license. Through numerical experiments we show that PyVRP achieves state-of-the-art results on the VRPTW and capacitated VRP. We hope that PyVRP enables researchers and practitioners to easily and quickly build on a state-of-the-art VRP solver.
comment: Pre-print of accepted paper in INFORMS Journal on Computing. 24 pages, 1 figure, 2 listings
♻ ☆ LMM-Assisted Breast Cancer Treatment Target Segmentation with Consistency Embedding
Recent advancements in Artificial Intelligence (AI) have profoundly influenced medical fields, by providing tools to reduce clinical workloads. However, most AI models are constrained to execute unimodal tasks, in stark contrast to the comprehensive approaches utilized by medical professionals. To address this, here we present RO-LMM, a multi-purpose large multimodal model (LMM) tailored for the field of radiation oncology. This model covers series of tasks within clinical workflow, adept at clinical report summarization, radiation treatment plan suggestion, and plan-guided target volume segmentation. In particular, to perform consecutive clinical tasks, we further present a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which boosts LMM's robustness to noisy inputs while preserving the capability of handling clean inputs, and transform this concept into LMM-driven segmentation framework as Consistency Embedding Segmentation~(CESEG). Experimental results on multi-centre cohorts demonstrate our RO-LMM's promising performance for multiple clinical tasks with generalization capabilities.
comment: 30 pages, 16 table, 5 figures
♻ ☆ ED-NeRF: Efficient Text-Guided Editing of 3D Scene with Latent Space NeRF ICLR 2024
Recently, there has been a significant advancement in text-to-image diffusion models, leading to groundbreaking performance in 2D image generation. These advancements have been extended to 3D models, enabling the generation of novel 3D objects from textual descriptions. This has evolved into NeRF editing methods, which allow the manipulation of existing 3D objects through textual conditioning. However, existing NeRF editing techniques have faced limitations in their performance due to slow training speeds and the use of loss functions that do not adequately consider editing. To address this, here we present a novel 3D NeRF editing approach dubbed ED-NeRF by successfully embedding real-world scenes into the latent space of the latent diffusion model (LDM) through a unique refinement layer. This approach enables us to obtain a NeRF backbone that is not only faster but also more amenable to editing compared to traditional image space NeRF editing. Furthermore, we propose an improved loss function tailored for editing by migrating the delta denoising score (DDS) distillation loss, originally used in 2D image editing to the three-dimensional domain. This novel loss function surpasses the well-known score distillation sampling (SDS) loss in terms of suitability for editing purposes. Our experimental results demonstrate that ED-NeRF achieves faster editing speed while producing improved output quality compared to state-of-the-art 3D editing models.
comment: ICLR 2024; Project Page: https://jhq1234.github.io/ed-nerf.github.io/
♻ ☆ QH9: A Quantum Hamiltonian Prediction Benchmark for QM9 Molecules NeurIPS 2023
Supervised machine learning approaches have been increasingly used in accelerating electronic structure prediction as surrogates of first-principle computational methods, such as density functional theory (DFT). While numerous quantum chemistry datasets focus on chemical properties and atomic forces, the ability to achieve accurate and efficient prediction of the Hamiltonian matrix is highly desired, as it is the most important and fundamental physical quantity that determines the quantum states of physical systems and chemical properties. In this work, we generate a new Quantum Hamiltonian dataset, named as QH9, to provide precise Hamiltonian matrices for 999 or 2998 molecular dynamics trajectories and 130,831 stable molecular geometries, based on the QM9 dataset. By designing benchmark tasks with various molecules, we show that current machine learning models have the capacity to predict Hamiltonian matrices for arbitrary molecules. Both the QH9 dataset and the baseline models are provided to the community through an open-source benchmark, which can be highly valuable for developing machine learning methods and accelerating molecular and materials design for scientific and technological applications. Our benchmark is publicly available at https://github.com/divelab/AIRS/tree/main/OpenDFT/QHBench.
comment: Accepted by NeurIPS 2023, Track on Datasets and Benchmarks
♻ ☆ Task Graph offloading via Deep Reinforcement Learning in Mobile Edge Computing
Various mobile applications that comprise dependent tasks are gaining widespread popularity and are increasingly complex. These applications often have low-latency requirements, resulting in a significant surge in demand for computing resources. With the emergence of mobile edge computing (MEC), it becomes the most significant issue to offload the application tasks onto small-scale devices deployed at the edge of the mobile network for obtaining a high-quality user experience. However, since the environment of MEC is dynamic, most existing works focusing on task graph offloading, which rely heavily on expert knowledge or accurate analytical models, fail to fully adapt to such environmental changes, resulting in the reduction of user experience. This paper investigates the task graph offloading in MEC, considering the time-varying computation capabilities of edge computing devices. To adapt to environmental changes, we model the task graph scheduling for computation offloading as a Markov Decision Process (MDP). Then, we design a deep reinforcement learning algorithm (SATA-DRL) to learn the task scheduling strategy from the interaction with the environment, to improve user experience. Extensive simulations validate that SATA-DRL is superior to existing strategies in terms of reducing average makespan and deadline violation.
comment: 13 figures
♻ ☆ PGCN: Progressive Graph Convolutional Networks for Spatial-Temporal Traffic Forecasting
The complex spatial-temporal correlations in transportation networks make the traffic forecasting problem challenging. Since transportation system inherently possesses graph structures, many research efforts have been put with graph neural networks. Recently, constructing adaptive graphs to the data has shown promising results over the models relying on a single static graph structure. However, the graph adaptations are applied during the training phases and do not reflect the data used during the testing phases. Such shortcomings can be problematic especially in traffic forecasting since the traffic data often suffer from unexpected changes and irregularities in the time series. In this study, we propose a novel traffic forecasting framework called Progressive Graph Convolutional Network (PGCN). PGCN constructs a set of graphs by progressively adapting to online input data during the training and testing phases. Specifically, we implemented the model to construct progressive adjacency matrices by learning trend similarities among graph nodes. Then, the model is combined with the dilated causal convolution and gated activation unit to extract temporal features. With residual and skip connections, PGCN performs the traffic prediction. When applied to seven real-world traffic datasets of diverse geometric nature, the proposed model achieves state-of-the-art performance with consistency in all datasets. We conclude that the ability of PGCN to progressively adapt to input data enables the model to generalize in different study sites with robustness.
comment: 12 pages, 6 figures
♻ ☆ Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections
Today's robot policies exhibit subpar performance when faced with the challenge of generalizing to novel environments. Human corrective feedback is a crucial form of guidance to enable such generalization. However, adapting to and learning from online human corrections is a non-trivial endeavor: not only do robots need to remember human feedback over time to retrieve the right information in new settings and reduce the intervention rate, but also they would need to be able to respond to feedback that can be arbitrary corrections about high-level human preferences to low-level adjustments to skill parameters. In this work, we present Distillation and Retrieval of Online Corrections (DROC), a large language model (LLM)-based system that can respond to arbitrary forms of language feedback, distill generalizable knowledge from corrections, and retrieve relevant past experiences based on textual and visual similarity for improving performance in novel settings. DROC is able to respond to a sequence of online language corrections that address failures in both high-level task plans and low-level skill primitives. We demonstrate that DROC effectively distills the relevant information from the sequence of online corrections in a knowledge base and retrieves that knowledge in settings with new task or object instances. DROC outperforms other techniques that directly generate robot code via LLMs by using only half of the total number of corrections needed in the first round and requires little to no corrections after two iterations. We show further results, videos, prompts and code on https://sites.google.com/stanford.edu/droc .
comment: 8 pages, 4 figures, videos and code links on website https://sites.google.com/stanford.edu/droc
♻ ☆ Time-Synchronized Full System State Estimation Considering Practical Implementation Challenges
As the phasor measurement unit (PMU) placement problem involves a cost-benefit trade-off, more PMUs get placed on the higher voltage buses. However, this causes many of the lower voltage levels of the bulk power system to not be observed by PMUs. This lack of visibility then makes time-synchronized state estimation of the full system a challenging problem. We propose a Deep Neural network-based State Estimator (DeNSE) to overcome this problem. The DeNSE employs a Bayesian framework to indirectly combine inferences drawn from slow timescale but widespread supervisory control and data acquisition (SCADA) data with fast timescale but select PMU data to attain sub-second situational awareness of the entire system. The practical utility of the proposed approach is demonstrated by considering topology changes, non-Gaussian measurement noise, and bad data detection and correction. The results obtained using the IEEE 118-bus system show the superiority of the DeNSE over a purely SCADA state estimator and a PMU-only linear state estimator from a techno-economic viability perspective. Lastly, scalability of the DeNSE is proven by estimating the states of a large and realistic 2000-bus Synthetic Texas system.
♻ ☆ TiC-CLIP: Continual Training of CLIP Models ICLR 2024
Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.
comment: ICLR 2024
♻ ☆ A Physics Enhanced Residual Learning (PERL) Framework for Vehicle Trajectory Prediction
In vehicle trajectory prediction, physics models and data-driven models are two predominant methodologies. However, each approach presents its own set of challenges: physics models fall short in predictability, while data-driven models lack interpretability. Addressing these identified shortcomings, this paper proposes a novel framework, the Physics-Enhanced Residual Learning (PERL) model. PERL integrates the strengths of physics-based and data-driven methods for traffic state prediction. PERL contains a physics model and a residual learning model. Its prediction is the sum of the physics model result and a predicted residual as a correction to it. It preserves the interpretability inherent to physics-based models and has reduced data requirements compared to data-driven methods. Experiments were conducted using a real-world vehicle trajectory dataset. We proposed a PERL model, with the Intelligent Driver Model (IDM) as its physics car-following model and Long Short-Term Memory (LSTM) as its residual learning model. We compare this PERL model with the physics car-following model, data-driven model, and other physics-informed neural network (PINN) models. The result reveals that PERL achieves better prediction with a small dataset, compared to the physics model, data-driven model, and PINN model. Second, the PERL model showed faster convergence during training, offering comparable performance with fewer training samples than the data-driven model and PINN model. Sensitivity analysis also proves comparable performance of PERL using another residual learning model and a physics car-following model.
♻ ☆ Hyper-parameter Tuning for Fair Classification without Sensitive Attribute Access
Fair machine learning methods seek to train models that balance model performance across demographic subgroups defined over sensitive attributes like race and gender. Although sensitive attributes are typically assumed to be known during training, they may not be available in practice due to privacy and other logistical concerns. Recent work has sought to train fair models without sensitive attributes on training data. However, these methods need extensive hyper-parameter tuning to achieve good results, and hence assume that sensitive attributes are known on validation data. However, this assumption too might not be practical. Here, we propose Antigone, a framework to train fair classifiers without access to sensitive attributes on either training or validation data. Instead, we generate pseudo sensitive attributes on the validation data by training a biased classifier and using the classifier's incorrectly (correctly) labeled examples as proxies for minority (majority) groups. Since fairness metrics like demographic parity, equal opportunity and subgroup accuracy can be estimated to within a proportionality constant even with noisy sensitive attribute information, we show theoretically and empirically that these proxy labels can be used to maximize fairness under average accuracy constraints. Key to our results is a principled approach to select the hyper-parameters of the biased classifier in a completely unsupervised fashion (meaning without access to ground truth sensitive attributes) that minimizes the gap between fairness estimated using noisy versus ground-truth sensitive labels.
♻ ☆ Weighted Ensemble Models Are Strong Continual Learners
In this work, we study the problem of continual learning (CL) where the goal is to learn a model on a sequence of tasks, such that the data from the previous tasks becomes unavailable while learning on the current task data. CL is essentially a balancing act between being able to learn on the new task (i.e., plasticity) and maintaining the performance on the previously learned concepts (i.e., stability). Intending to address the stability-plasticity trade-off, we propose to perform weight-ensembling of the model parameters of the previous and current tasks. This weighted-ensembled model, which we call Continual Model Averaging (or CoMA), attains high accuracy on the current task by leveraging plasticity, while not deviating too far from the previous weight configuration, ensuring stability. We also propose an improved variant of CoMA, named Continual Fisher-weighted Model Averaging (or CoFiMA), that selectively weighs each parameter in the weights ensemble by leveraging the Fisher information of the weights of the model. Both variants are conceptually simple, easy to implement, and effective in attaining state-of-the-art performance on several standard CL benchmarks. Code is available at: https://github.com/IemProg/CoFiMA.
comment: Code: https://github.com/IemProg/CoFiMA
♻ ☆ Enhancing Multimodal Cooperation via Fine-grained Modality Valuation CVPR 2024
One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multimodal cooperation, which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation, we observe that modality discrepancy indeed could be different at sample-level, beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at \url{https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation}.
comment: Accepted by CVPR 2024
♻ ☆ Arcee's MergeKit: A Toolkit for Merging Large Language Models
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.
comment: 11 pages, 4 figures
♻ ☆ $\mathbb{Z}_2\times \mathbb{Z}_2$ Equivariant Quantum Neural Networks: Benchmarking against Classical Neural Networks
This paper presents a comprehensive comparative analysis of the performance of Equivariant Quantum Neural Networks (EQNN) and Quantum Neural Networks (QNN), juxtaposed against their classical counterparts: Equivariant Neural Networks (ENN) and Deep Neural Networks (DNN). We evaluate the performance of each network with two toy examples for a binary classification task, focusing on model complexity (measured by the number of parameters) and the size of the training data set. Our results show that the $\mathbb{Z}_2\times \mathbb{Z}_2$ EQNN and the QNN provide superior performance for smaller parameter sets and modest training data samples.
comment: 13 pages, 7 figures
♻ ☆ A Survey on Uncertainty Quantification for Deep Learning: An Uncertainty Source Perspective
Deep neural networks (DNNs) have achieved tremendous success in making accurate predictions for computer vision, natural language processing, as well as science and engineering domains. However, it is also well-recognized that DNNs sometimes make unexpected, incorrect, but overconfident predictions. This can cause serious consequences in high-stake applications, such as autonomous driving, medical diagnosis, and disaster response. Uncertainty quantification (UQ) aims to estimate the confidence of DNN predictions beyond prediction accuracy. In recent years, many UQ methods have been developed for DNNs. It is of great practical value to systematically categorize these UQ methods and compare their advantages and disadvantages. However, existing surveys mostly focus on categorizing UQ methodologies from a neural network architecture perspective or a Bayesian perspective and ignore the source of uncertainty that each methodology can incorporate, making it difficult to select an appropriate UQ method in practice. To fill the gap, this paper presents a systematic taxonomy of UQ methods for DNNs based on the types of uncertainty sources (data uncertainty versus model uncertainty). We summarize the advantages and disadvantages of methods in each category. We show how our taxonomy of UQ methodologies can potentially help guide the choice of UQ method in different machine learning problems (e.g., active learning, robustness, and reinforcement learning). We also identify current research gaps and propose several future research directions.
comment: 39 pages, 14 figures
♻ ☆ From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards
Recent progress in large language models (LLMs) has led to their widespread adoption in various domains. However, these advancements have also introduced additional safety risks and raised concerns regarding their detrimental impact on already marginalized populations. Despite growing mitigation efforts to develop safety safeguards, such as supervised safety-oriented fine-tuning and leveraging safe reinforcement learning from human feedback, multiple concerns regarding the safety and ingrained biases in these models remain. Furthermore, previous work has demonstrated that models optimized for safety often display exaggerated safety behaviors, such as a tendency to refrain from responding to certain requests as a precautionary measure. As such, a clear trade-off between the helpfulness and safety of these models has been documented in the literature. In this paper, we further investigate the effectiveness of safety measures by evaluating models on already mitigated biases. Using the case of Llama 2 as an example, we illustrate how LLMs' safety responses can still encode harmful assumptions. To do so, we create a set of non-toxic prompts, which we then use to evaluate Llama models. Through our new taxonomy of LLMs responses to users, we observe that the safety/helpfulness trade-offs are more pronounced for certain demographic groups which can lead to quality-of-service harms for marginalized populations.
comment: 9 pages, 4 figures
♻ ☆ Learning to Make Adherence-Aware Advice
As artificial intelligence (AI) systems play an increasingly prominent role in human decision-making, challenges surface in the realm of human-AI interactions. One challenge arises from the suboptimal AI policies due to the inadequate consideration of humans disregarding AI recommendations, as well as the need for AI to provide advice selectively when it is most pertinent. This paper presents a sequential decision-making model that (i) takes into account the human's adherence level (the probability that the human follows/rejects machine advice) and (ii) incorporates a defer option so that the machine can temporarily refrain from making advice. We provide learning algorithms that learn the optimal advice policy and make advice only at critical time stamps. Compared to problem-agnostic reinforcement learning algorithms, our specialized learning algorithms not only enjoy better theoretical convergence properties but also show strong empirical performance.
♻ ☆ Dodging DeepFake Detection via Implicit Spatial-Domain Notch Filtering
The current high-fidelity generation and high-precision detection of DeepFake images are at an arms race. We believe that producing DeepFakes that are highly realistic and 'detection evasive' can serve the ultimate goal of improving future generation DeepFake detection capabilities. In this paper, we propose a simple yet powerful pipeline to reduce the artifact patterns of fake images without hurting image quality by performing implicit spatial-domain notch filtering. We first demonstrate that frequency-domain notch filtering, although famously shown to be effective in removing periodic noise in the spatial domain, is infeasible for our task at hand due to the manual designs required for the notch filters. We, therefore, resort to a learning-based approach to reproduce the notch filtering effects, but solely in the spatial domain. We adopt a combination of adding overwhelming spatial noise for breaking the periodic noise pattern and deep image filtering to reconstruct the noise-free fake images, and we name our method DeepNotch. Deep image filtering provides a specialized filter for each pixel in the noisy image, producing filtered images with high fidelity compared to their DeepFake counterparts. Moreover, we also use the semantic information of the image to generate an adversarial guidance map to add noise intelligently. Our large-scale evaluation on 3 representative state-of-the-art DeepFake detection methods (tested on 16 types of DeepFakes) has demonstrated that our technique significantly reduces the accuracy of these 3 fake image detection methods, 36.79% on average and up to 97.02% in the best case.
comment: 14 pages
♻ ☆ Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles
Counterfactual explanations describe how to modify a feature vector in order to flip the outcome of a trained classifier. Obtaining robust counterfactual explanations is essential to provide valid algorithmic recourse and meaningful explanations. We study the robustness of explanations of randomized ensembles, which are always subject to algorithmic uncertainty even when the training data is fixed. We formalize the generation of robust counterfactual explanations as a probabilistic problem and show the link between the robustness of ensemble models and the robustness of base learners. We develop a practical method with good empirical performance and support it with theoretical guarantees for ensembles of convex base learners. Our results show that existing methods give surprisingly low robustness: the validity of naive counterfactuals is below $50\%$ on most data sets and can fall to $20\%$ on problems with many features. In contrast, our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
♻ ☆ On the consistency of supervised learning with missing values
In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through multiple imputation. Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.
♻ ☆ Class-Prototype Conditional Diffusion Model with Gradient Projection for Continual Learning
Mitigating catastrophic forgetting is a key hurdle in continual learning. Deep Generative Replay (GR) provides techniques focused on generating samples from prior tasks to enhance the model's memory capabilities using generative AI models ranging from Generative Adversarial Networks (GANs) to the more recent Diffusion Models (DMs). A major issue is the deterioration in the quality of generated data compared to the original, as the generator continuously self-learns from its outputs. This degradation can lead to the potential risk of catastrophic forgetting (CF) occurring in the classifier. To address this, we propose the Gradient Projection Class-Prototype Conditional Diffusion Model (GPPDM), a GR-based approach for continual learning that enhances image quality in generators and thus reduces the CF in classifiers. The cornerstone of GPPDM is a learnable class prototype that captures the core characteristics of images in a given class. This prototype, integrated into the diffusion model's denoising process, ensures the generation of high-quality images of the old tasks, hence reducing the risk of CF in classifiers. Moreover, to further mitigate the CF of diffusion models, we propose a gradient projection technique tailored for the cross-attention layer of diffusion models to maximally maintain and preserve the representations of old task data in the current task as close as possible to their representations when they first arrived. Our empirical studies on diverse datasets demonstrate that our proposed method significantly outperforms existing state-of-the-art models, highlighting its satisfactory ability to preserve image quality and enhance the model's memory retention.
♻ ☆ Noisy Interpolation Learning with Shallow Univariate ReLU Networks ICLR 2024
Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. 2022 noted that neural networks seem to often exhibit ``tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, this has not been studied rigorously. We provide the first rigorous analysis of the overfitting behavior of regression with minimum norm ($\ell_2$ of weights), focusing on univariate two-layer ReLU networks. We show overfitting is tempered (with high probability) when measured with respect to the $L_1$ loss, but also show that the situation is more complex than suggested by Mallinar et. al., and overfitting is catastrophic with respect to the $L_2$ loss, or when taking an expectation over the training set.
comment: To appear at ICLR 2024. Updated version with minor changes in the presentation
♻ ☆ Stabilizing reinforcement learning control: A modular framework for optimizing over all stable behavior
We propose a framework for the design of feedback controllers that combines the optimization-driven and model-free advantages of deep reinforcement learning with the stability guarantees provided by using the Youla-Kucera parameterization to define the search domain. Recent advances in behavioral systems allow us to construct a data-driven internal model; this enables an alternative realization of the Youla-Kucera parameterization based entirely on input-output exploration data. Perhaps of independent interest, we formulate and analyze the stability of such data-driven models in the presence of noise. The Youla-Kucera approach requires a stable "parameter" for controller design. For the training of reinforcement learning agents, the set of all stable linear operators is given explicitly through a matrix factorization approach. Moreover, a nonlinear extension is given using a neural network to express a parameterized set of stable operators, which enables seamless integration with standard deep learning libraries. Finally, we show how these ideas can also be applied to tune fixed-structure controllers.
comment: Postprint; 31 pages. arXiv admin note: text overlap with arXiv:2304.03422
♻ ☆ Short-Form Videos and Mental Health: A Knowledge-Guided Neural Topic Model
While short-form videos head to reshape the entire social media landscape, experts are exceedingly worried about their depressive impacts on viewers, as evidenced by medical studies. To prevent widespread consequences, platforms are eager to predict these videos' impact on viewers' mental health. Subsequently, they can take intervention measures, such as revising recommendation algorithms and displaying viewer discretion. Nevertheless, applicable predictive methods lack relevance to well-established medical knowledge, which outlines clinically proven external and environmental factors of depression. To account for such medical knowledge, we resort to an emergent methodological discipline, seeded Neural Topic Models (NTMs). However, existing seeded NTMs suffer from the limitations of single-origin topics, unknown topic sources, unclear seed supervision, and suboptimal convergence. To address those challenges, we develop a novel Knowledge-guided Multimodal NTM to predict a short-form video's depressive impact on viewers. Extensive empirical analyses using TikTok and Douyin datasets prove that our method outperforms state-of-the-art benchmarks. Our method also discovers medically relevant topics from videos that are linked to depressive impact. We contribute to IS with a novel video analytics method that is generalizable to other video classification problems. Practically, our method can help platforms understand videos' mental impacts, thus adjusting recommendations and video topic disclosure.
♻ ☆ Gaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling
The connections between (convex) optimization and (logconcave) sampling have been considerably enriched in the past decade with many conceptual and mathematical analogies. For instance, the Langevin algorithm can be viewed as a sampling analogue of gradient descent and has condition-number-dependent guarantees on its performance. In the early 1990s, Nesterov and Nemirovski developed the Interior-Point Method (IPM) for convex optimization based on self-concordant barriers, providing efficient algorithms for structured convex optimization, often faster than the general method. This raises the following question: can we develop an analogous IPM for structured sampling problems? In 2012, Kannan and Narayanan proposed the Dikin walk for uniformly sampling polytopes, and an improved analysis was given in 2020 by Laddha-Lee-Vempala. The Dikin walk uses a local metric defined by a self-concordant barrier for linear constraints. Here we generalize this approach by developing and adapting IPM machinery together with the Dikin walk for poly-time sampling algorithms. Our IPM-based sampling framework provides an efficient warm start and goes beyond uniform distributions and linear constraints. We illustrate the approach on important special cases, in particular giving the fastest algorithms to sample uniform, exponential, or Gaussian distributions on a truncated PSD cone. The framework is general and can be applied to other sampling algorithms.
comment: Improved writing with minor errors fixed
Cryptography and Security
☆ Global, robust and comparable digital carbon assets
Carbon credits purchased in the voluntary carbon market allow unavoidable emissions, such as from international flights for essential travel, to be offset by an equivalent climate benefit, such as avoiding emissions from tropical deforestation. However, many concerns regarding the credibility of these offsetting claims have been raised. Moreover, the credit market is manual, therefore inefficient and unscalable, and non-fungible, therefore illiquid. To address these issues, we propose an efficient digital methodology that combines remote sensing data, modern econometric techniques, and on-chain certification and trading to create a new digital carbon asset (the PACT stablecoin) against which carbon offsetting claims can be transparently verified. PACT stablecoins are produced as outputs from a reproducible computational pipeline for estimating the climate benefits of carbon offset projects that not only quantifies the CO2 emissions involved, but also allows for similar credits to be pooled based on their co-benefits such as biodiversity and jurisdictional attributes, increasing liquidity through fungibility within pools. We implement and evaluate the PACT carbon stablecoin on the Tezos blockchain, which is designed to facilitate low-cost transactions while minimizing environmental impact. Our implementation includes a contract for a registry for tracking issuance, ownership, and retirement of credits, and a custodian contract to bridge on-chain and off-chain transactions. Our work brings scale and trust to the voluntary carbon market by providing a transparent, scalable, and efficient framework for high integrity carbon credit transactions.
comment: 10 pages. Extended version, March 2024. A shortened version is to be published at the 6th IEEE International Conference on Blockchain and Cryptocurrency (ICBC 2024)
☆ Maximal $α$-Leakage for Quantum Privacy Mechanisms
In this work, maximal $\alpha$-leakage is introduced to quantify how much a quantum adversary can learn about any sensitive information of data upon observing its disturbed version via a quantum privacy mechanism. We first show that an adversary's maximal expected $\alpha$-gain using optimal measurement is characterized by measured conditional R\'enyi entropy. This can be viewed as a parametric generalization of K\"onig et al.'s famous guessing probability formula [IEEE Trans. Inf. Theory, 55(9), 2009]. Then, we prove that the $\alpha$-leakage and maximal $\alpha$-leakage for a quantum privacy mechanism are determined by measured Arimoto information and measured R\'enyi capacity, respectively. Various properties of maximal $\alpha$-leakage, such as data processing inequality and composition property are established as well. Moreover, we show that regularized $\alpha$-leakage and regularized maximal $\alpha$-leakage for identical and independent quantum privacy mechanisms coincide with $\alpha$-tilted sandwiched R\'enyi information and sandwiched R\'enyi capacity, respectively.
☆ FHAUC: Privacy Preserving AUC Calculation for Federated Learning using Fully Homomorphic Encryption
Ensuring data privacy is a significant challenge for machine learning applications, not only during model training but also during evaluation. Federated learning has gained significant research interest in recent years as a result. Current research on federated learning primarily focuses on preserving privacy during the training phase. However, model evaluation has not been adequately addressed, despite the potential for significant privacy leaks during this phase as well. In this paper, we demonstrate that the state-of-the-art AUC computation method for federated learning systems, which utilizes differential privacy, still leaks sensitive information about the test data while also requiring a trusted central entity to perform the computations. More importantly, we show that the performance of this method becomes completely unusable as the data size decreases. In this context, we propose an efficient, accurate, robust, and more secure evaluation algorithm capable of computing the AUC in horizontal federated learning systems. Our approach not only enhances security compared to the current state-of-the-art but also surpasses the state-of-the-art AUC computation method in both approximation performance and computational robustness, as demonstrated by experimental results. To illustrate, our approach can efficiently calculate the AUC of a federated learning system involving 100 parties, achieving 99.93% accuracy in just 0.68 seconds, regardless of data size, while providing complete data privacy.
☆ Adversary-Augmented Simulation to evaluate client-fairness on HyperLedger Fabric
This paper presents a novel adversary model specifically tailored to distributed systems, with the aim to asses the security of blockchain technologies. Building upon literature on adversarial assumptions and capabilities, we include classical notions of failure and communication models to classify and bind the use of adversarial actions. We focus on the effect of these actions on properties of distributed protocols. A significant effort of our research is the integration of this model into the Multi-Agent eXperimenter (MAX) framework. This integration enables realistic simulations of adversarial attacks on blockchain systems. In particular, we have simulated attacks violating a form of client-fairness on HyperLedger Fabric.
comment: 10 pages (2 pages of references), 8 figures
☆ A Differentially Private Clustering Algorithm for Well-Clustered Graphs
We study differentially private (DP) algorithms for recovering clusters in well-clustered graphs, which are graphs whose vertex set can be partitioned into a small number of sets, each inducing a subgraph of high inner conductance and small outer conductance. Such graphs have widespread application as a benchmark in the theoretical analysis of spectral clustering. We provide an efficient ($\epsilon$,$\delta$)-DP algorithm tailored specifically for such graphs. Our algorithm draws inspiration from the recent work of Chen et al., who developed DP algorithms for recovery of stochastic block models in cases where the graph comprises exactly two nearly-balanced clusters. Our algorithm works for well-clustered graphs with $k$ nearly-balanced clusters, and the misclassification ratio almost matches the one of the best-known non-private algorithms. We conduct experimental evaluations on datasets with known ground truth clusters to substantiate the prowess of our algorithm. We also show that any (pure) $\epsilon$-DP algorithm would result in substantial error.
☆ Large Language Models for Blockchain Security: A Systematic Literature Review
Large Language Models (LLMs) have emerged as powerful tools in various domains involving blockchain security (BS). Several recent studies are exploring LLMs applied to BS. However, there remains a gap in our understanding regarding the full scope of applications, impacts, and potential constraints of LLMs on blockchain security. To fill this gap, we conduct a literature review on LLM4BS. As the first review of LLM's application on blockchain security, our study aims to comprehensively analyze existing research and elucidate how LLMs contribute to enhancing the security of blockchain systems. Through a thorough examination of scholarly works, we delve into the integration of LLMs into various aspects of blockchain security. We explore the mechanisms through which LLMs can bolster blockchain security, including their applications in smart contract auditing, identity verification, anomaly detection, vulnerable repair, and so on. Furthermore, we critically assess the challenges and limitations associated with leveraging LLMs for blockchain security, considering factors such as scalability, privacy concerns, and adversarial attacks. Our review sheds light on the opportunities and potential risks inherent in this convergence, providing valuable insights for researchers, practitioners, and policymakers alike.
☆ Safeguarding Medical Image Segmentation Datasets against Unauthorized Training via Contour- and Texture-Aware Perturbations
The widespread availability of publicly accessible medical images has significantly propelled advancements in various research and clinical fields. Nonetheless, concerns regarding unauthorized training of AI systems for commercial purposes and the duties of patient privacy protection have led numerous institutions to hesitate to share their images. This is particularly true for medical image segmentation (MIS) datasets, where the processes of collection and fine-grained annotation are time-intensive and laborious. Recently, Unlearnable Examples (UEs) methods have shown the potential to protect images by adding invisible shortcuts. These shortcuts can prevent unauthorized deep neural networks from generalizing. However, existing UEs are designed for natural image classification and fail to protect MIS datasets imperceptibly as their protective perturbations are less learnable than important prior knowledge in MIS, e.g., contour and texture features. To this end, we propose an Unlearnable Medical image generation method, termed UMed. UMed integrates the prior knowledge of MIS by injecting contour- and texture-aware perturbations to protect images. Given that our target is to only poison features critical to MIS, UMed requires only minimal perturbations within the ROI and its contour to achieve greater imperceptibility (average PSNR is 50.03) and protective performance (clean average DSC degrades from 82.18% to 6.80%).
☆ Quantum-activated neural reservoirs on-chip open up large hardware security models for resilient authentication
Quantum artificial intelligence is a frontier of artificial intelligence research, pioneering quantum AI-powered circuits to address problems beyond the reach of deep learning with classical architectures. This work implements a large-scale quantum-activated recurrent neural network possessing more than 3 trillion hardware nodes/cm$^2$, originating from repeatable atomic-scale nucleation dynamics in an amorphous material integrated on-chip, controlled with 0.07 nW electric power per readout channel. Compared to the best-performing reservoirs currently reported, this implementation increases the scale of the network by two orders of magnitude and reduces the power consumption by six, reaching power efficiencies in the range of the human brain, dissipating 0.2 nW/neuron. When interrogated by a classical input, the chip implements a large-scale hardware security model, enabling dictionary-free authentication secure against statistical inference attacks, including AI's present and future development, even for an adversary with a copy of all the classical components available. Experimental tests report 99.6% reliability, 100% user authentication accuracy, and an ideal 50% key uniqueness. Due to its quantum nature, the chip supports a bit density per feature size area three times higher than the best technology available, with the capacity to store more than $2^{1104}$ keys in a footprint of 1 cm$^2$. Such a quantum-powered platform could help counteract the emerging form of warfare led by the cybercrime industry in breaching authentication to target small to large-scale facilities, from private users to intelligent energy grids.
☆ HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption ICML 2023
Transfer learning is a de facto standard method for efficiently training machine learning models for data-scarce problems by adding and fine-tuning new classification layers to a model pre-trained on large datasets. Although numerous previous studies proposed to use homomorphic encryption to resolve the data privacy issue in transfer learning in the machine learning as a service setting, most of them only focused on encrypted inference. In this study, we present HETAL, an efficient Homomorphic Encryption based Transfer Learning algorithm, that protects the client's privacy in training tasks by encrypting the client data using the CKKS homomorphic encryption scheme. HETAL is the first practical scheme that strictly provides encrypted training, adopting validation-based early stopping and achieving the accuracy of nonencrypted training. We propose an efficient encrypted matrix multiplication algorithm, which is 1.8 to 323 times faster than prior methods, and a highly precise softmax approximation algorithm with increased coverage. The experimental results for five well-known benchmark datasets show total training times of 567-3442 seconds, which is less than an hour.
comment: ICML 2023, Appendix D includes some updates after official publication
☆ DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning
Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our \emph{differentially private retrieval-augmented diffusion model} (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of $\epsilon=10$, while providing a $3.5$ point improvement in FID compared to public-only retrieval for up to $10,000$ queries.
☆ Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation. Detecting DeepFakes is currently solved with programmed machine learning algorithms. In this work, we investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection. We conducted qualitative and quantitative experiments to demonstrate multimodal LLMs and show that they can expose AI-generated images through careful experimental design and prompt engineering. This is interesting, considering that LLMs are not inherently tailored for media forensic tasks, and the process does not require programming. We discuss the limitations of multimodal LLMs for these tasks and suggest possible improvements.
☆ Establishing a leader in a pairwise comparisons method
Abstract Like electoral systems, decision-making methods are also vulnerable to manipulation by decision-makers. The ability to effectively defend against such threats can only come from thoroughly understanding the manipulation mechanisms. In the presented article, we show two algorithms that can be used to launch a manipulation attack. They allow for equating the weights of two selected alternatives in the pairwise comparison method and, consequently, choosing a leader. The theoretical considerations are accompanied by a Monte Carlo simulation showing the relationship between the size of the PC matrix, the degree of inconsistency, and the ease of manipulation. This work is a continuation of our previous research published in the paper (Szybowski et al., 2023)
comment: 9 figures, 19 pages
☆ Structuring the Chaos: Enabling Small Business Cyber-Security Risks & Assets Modelling with a UML Class Model
Small businesses are increasingly adopting IT, and consequently becoming more vulnerable to cyber-incidents. Whilst small businesses are aware of the cyber-security risks, many struggle with implementing mitigations. Some of these can be traced to fundamental differences in the characteristics of small business versus large enterprises where modern cyber-security solutions are widely deployed. Small business specific cyber-security tools are needed. Currently available cyber-security tools and standards assume technical expertise and time resources often not practical for small businesses. Cyber-security competes with other roles that small business owners take on, e.g. cleaning, sales etc. A small business model, salient and implementable at-scale, with simplified non-specialist terminologies and presentation is needed to encourage sustained participation of all stakeholders, not just technical ones. We propose a new UML class (Small IT Data (SITD)) model to support the often chaotic information-gathering phase of a small business' first foray into cyber-security. The SITD model is designed in the UML format to help small business implement technical solutions. The SITD model structure stays relevant by using generic classes and structures that evolve with technology and environmental changes. The SITD model keeps security decisions proportionate to the business by highlighting relationships between business strategy tasks and IT infrastructure. We construct a set of design principles to address small business cyber-security needs. Model components are designed in response to these needs. The uses of the SITD model are then demonstrated and design principles validated by examining a case study of a real small business operational and IT information. The SITD model's ability to illustrate breach information is also demonstrated using the NotPetya incident.
☆ Few-Shot Adversarial Prompt Learning on Vision-Language Models
The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Inspired by the success of vision-language foundation models, previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision. However, in practice, they are still unsatisfactory due to several issues, including heavy adaptation cost, suboptimal text supervision, and uncontrolled natural generalization capacity. In this paper, to address these issues, we propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement. Specifically, we achieve this by providing adversarially correlated text supervision that is end-to-end learned from adversarial examples. We also propose a novel training objective that enhances the consistency of multi-modal features while encourages differentiated uni-modal features between natural and adversarial examples. The proposed framework gives access to learn adversarial text supervision, which provides superior cross-modal adversarial alignment and matches state-of-the-art zero-shot adversarial robustness with only 1% training data.
comment: 25 pages, 13 tables, 8 figures
☆ Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures
Recent model inversion attack algorithms permit adversaries to reconstruct a neural network's private training data just by repeatedly querying the network and inspecting its outputs. In this work, we develop a novel network architecture that leverages sparse-coding layers to obtain superior robustness to this class of attacks. Three decades of computer science research has studied sparse coding in the context of image denoising, object recognition, and adversarial misclassification settings, but to the best of our knowledge, its connection to state-of-the-art privacy vulnerabilities remains unstudied. However, sparse coding architectures suggest an advantageous means to defend against model inversion attacks because they allow us to control the amount of irrelevant private information encoded in a network's intermediate representations in a manner that can be computed efficiently during training and that is known to have little effect on classification accuracy. Specifically, compared to networks trained with a variety of state-of-the-art defenses, our sparse-coding architectures maintain comparable or higher classification accuracy while degrading state-of-the-art training data reconstructions by factors of 1.1 to 18.3 across a variety of reconstruction quality metrics (PSNR, SSIM, FID). This performance advantage holds across 5 datasets ranging from CelebA faces to medical images and CIFAR-10, and across various state-of-the-art SGD-based and GAN-based inversion attacks, including Plug-&-Play attacks. We provide a cluster-ready PyTorch codebase to promote research and standardize defense evaluations.
comment: 32 pages, 15 Tables, and 9 Figures
☆ Improving Galileo OSNMA Time To First Authenticated Fix
Galileo is the first global navigation satellite system to authenticate their civilian signals through the Open Service Galileo Message Authentication (OSNMA) protocol. However, OSNMA delays the time to obtain a first position and time fix, the so-called Time To First Authentication Fix (TTFAF). Reducing the TTFAF as much as possible is crucial to integrate the technology seamlessly into the current products. In the cases where the receiver already has cryptographic data available, the so-called hot start mode and focus of this article, the currently available implementations achieve an average TTFAF of around 100 seconds in ideal environments. In this work, we dissect the TTFAF process, propose two main optimizations to reduce the TTFAF, and benchmark them in three distinct scenarios (open-sky, soft urban, and hard urban) with recorded real data. Moreover, we evaluate the optimizations using the synthetic scenario from the official OSNMA test vectors. The first block of optimizations centers on extracting as much information as possible from broken sub-frames by processing them at page level and combining redundant data from multiple satellites. The second block of optimizations aims to reconstruct missed navigation data by using fields in the authentication tags belonging to the same sub-frame as the authentication key. Combining both optimizations improves the TTFAF substantially for all considered scenarios. We obtain an average TTFAF of 60.9 and 68.8 seconds for the test vectors and the open-sky scenario, respectively, with a best-case of 44.0 seconds in both. Likewise, the urban scenarios see a drastic reduction of the average TTFAF between the non-optimized and optimized cases, from 127.5 to 87.5 seconds in the soft urban scenario and from 266.1 to 146.1 seconds in the hard urban scenario. These optimizations are available as part of the open-source OSNMAlib library on GitHub.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 12 pages, 15 figures
☆ Reversible Jump Attack to Textual Classifiers with Modification Reduction
Recent studies on adversarial examples expose vulnerabilities of natural language processing (NLP) models. Existing techniques for generating adversarial examples are typically driven by deterministic hierarchical rules that are agnostic to the optimal adversarial examples, a strategy that often results in adversarial samples with a suboptimal balance between magnitudes of changes and attack successes. To this end, in this research we propose two algorithms, Reversible Jump Attack (RJA) and Metropolis-Hasting Modification Reduction (MMR), to generate highly effective adversarial examples and to improve the imperceptibility of the examples, respectively. RJA utilizes a novel randomization mechanism to enlarge the search space and efficiently adapts to a number of perturbed words for adversarial examples. With these generated adversarial examples, MMR applies the Metropolis-Hasting sampler to enhance the imperceptibility of adversarial examples. Extensive experiments demonstrate that RJA-MMR outperforms current state-of-the-art methods in attack performance, imperceptibility, fluency and grammar correctness.
♻ ☆ Exploring the Market Dynamics of Liquid Staking Derivatives (LSDs)
Staking has emerged as a crucial concept following Ethereum's transition to Proof-of-Stake consensus. The introduction of Liquid Staking Derivatives (LSDs) has effectively addressed the illiquidity issue associated with solo staking, gaining significant market attention. This paper analyzes the LSD market dynamics from the perspectives of both liquidity takers (LTs) and liquidity providers (LPs). We first quantify the price discrepancy between the LSD primary and secondary markets. Then we investigate and empirically measure how LTs can leverage such discrepancy to exploit arbitrage opportunities, unveiling the potential barriers to LSD arbitrages. In addition, we evaluate the financial profit and losses experienced by LPs who supply LSDs for liquidity provision. Our results show that 66% of LSD liquidity positions generate returns lower than those from simply holding the corresponding LSDs.
♻ ☆ TMI! Finetuned Models Leak Private Information from their Pretraining Data
Transfer learning has become an increasingly popular technique in machine learning as a way to leverage a pretrained model trained for one task to assist with building a finetuned model for a related task. This paradigm has been especially popular for $\textit{privacy}$ in machine learning, where the pretrained model is considered public, and only the data for finetuning is considered sensitive. However, there are reasons to believe that the data used for pretraining is still sensitive, making it essential to understand how much information the finetuned model leaks about the pretraining data. In this work we propose a new membership-inference threat model where the adversary only has access to the finetuned model and would like to infer the membership of the pretraining data. To realize this threat model, we implement a novel metaclassifier-based attack, $\textbf{TMI}$, that leverages the influence of memorized pretraining samples on predictions in the downstream task. We evaluate $\textbf{TMI}$ on both vision and natural language tasks across multiple transfer learning settings, including finetuning with differential privacy. Through our evaluation, we find that $\textbf{TMI}$ can successfully infer membership of pretraining examples using query access to the finetuned model. An open-source implementation of $\textbf{TMI}$ can be found $\href{https://github.com/johnmath/tmi-pets24}{\text{on GitHub}}$.
♻ ☆ Closing the Gap: Achieving Better Accuracy-Robustness Tradeoffs against Query-Based Attacks AAAI
Although promising, existing defenses against query-based attacks share a common limitation: they offer increased robustness against attacks at the price of a considerable accuracy drop on clean samples. In this work, we show how to efficiently establish, at test-time, a solid tradeoff between robustness and accuracy when mitigating query-based attacks. Given that these attacks necessarily explore low-confidence regions, our insight is that activating dedicated defenses, such as random noise defense and random image transformations, only for low-confidence inputs is sufficient to prevent them. Our approach is independent of training and supported by theory. We verify the effectiveness of our approach for various existing defenses by conducting extensive experiments on CIFAR-10, CIFAR-100, and ImageNet. Our results confirm that our proposal can indeed enhance these defenses by providing better tradeoffs between robustness and accuracy when compared to state-of-the-art approaches while being completely training-free.
comment: To appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2024
♻ ☆ Towards Secure Virtual Elections: Multiparty Computation of Order Based Voting Rules
Electronic voting systems are essential for holding virtual elections, and the need for such systems increases due to the COVID-19 pandemic and the social distancing that it mandates. One of the main challenges in e-voting systems is to secure the voting process: namely, to certify that the computed results are consistent with the cast ballots, and that the privacy of the voters is preserved. We propose herein a secure voting protocol for elections that are governed by order-based voting rules. Our protocol offers perfect ballot secrecy, in the sense that it issues only the required output, while no other information on the cast ballots is revealed. Such perfect secrecy, which is achieved by employing secure multiparty computation tools, may increase the voters' confidence and, consequently, encourage them to vote according to their true preferences. Evaluation of the protocol's computational costs establishes that it is lightweight and can be readily implemented in real-life electronic elections.
♻ ☆ On the Privacy of Selection Mechanisms with Gaussian Noise AISTATS 2024
Report Noisy Max and Above Threshold are two classical differentially private (DP) selection mechanisms. Their output is obtained by adding noise to a sequence of low-sensitivity queries and reporting the identity of the query whose (noisy) answer satisfies a certain condition. Pure DP guarantees for these mechanisms are easy to obtain when Laplace noise is added to the queries. On the other hand, when instantiated using Gaussian noise, standard analyses only yield approximate DP guarantees despite the fact that the outputs of these mechanisms lie in a discrete space. In this work, we revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise and show that, under the additional assumption that the underlying queries are bounded, it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold. The resulting bounds are tight and depend on closed-form expressions that can be numerically evaluated using standard methods. Empirically we find these lead to tighter privacy accounting in the high privacy, low data regime. Further, we propose a simple privacy filter for composing pure ex-post DP guarantees, and use it to derive a fully adaptive Gaussian Sparse Vector Technique mechanism. Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our Sparse Vector Technique is practically competitive with previous approaches and requires less hyper-parameter tuning.
comment: AISTATS 2024
♻ ☆ Adversarial Attacks and Defenses in Automated Control Systems: A Comprehensive Benchmark
Integrating machine learning into Automated Control Systems (ACS) enhances decision-making in industrial process management. One of the limitations to the widespread adoption of these technologies in industry is the vulnerability of neural networks to adversarial attacks. This study explores the threats in deploying deep learning models for fault diagnosis in ACS using the Tennessee Eastman Process dataset. By evaluating three neural networks with different architectures, we subject them to six types of adversarial attacks and explore five different defense methods. Our results highlight the strong vulnerability of models to adversarial samples and the varying effectiveness of defense strategies. We also propose a novel protection approach by combining multiple defense methods and demonstrate it's efficacy. This research contributes several insights into securing machine learning within ACS, ensuring robust fault diagnosis in industrial processes.
♻ ☆ Formalizing Stack Safety as a Security Property
The term stack safety is used to describe a variety of compiler, run-time, and hardware mechanisms for protecting stack memory. Unlike "the heap," the ISA-level stack does not correspond to a single high-level language concept: different compilers use it in different ways to support procedural and functional abstraction mechanisms from a wide range of languages. This protean nature makes it difficult to nail down what it means to correctly enforce stack safety. We propose a new formal characterization of stack safety using concepts from language-based security. Rather than treating stack safety as a monolithic property, we decompose it into an integrity property and a confidentiality property for each of the caller and the callee, plus a control-flow property: five properties in all. This formulation is motivated by a particular class of enforcement mechanisms, the "lazy" stack safety micro-policies studied by Roessler and DeHon, which permit functions to write into one another's frames but taint the changed locations so that the frame's owner cannot access them. No existing characterization of stack safety captures this style of safety; we capture it here by stating our properties in terms of the observable behavior of the system. Our properties go further than previous formal definitions of stack safety, supporting caller- and callee-saved registers, arguments passed on the stack, and tail-call elimination. We validate the properties by using them to distinguish between correct and incorrect implementations of Roessler and DeHon's micro-policies using property-based random testing. Our test harness successfully identifies several broken variants, including Roessler and DeHon's lazy policy; a repaired version of their policy passes our tests.
♻ ☆ Differentially Private Linear Bandits with Partial Distributed Feedback
In this paper, we study the problem of global reward maximization with only partial distributed feedback. This problem is motivated by several real-world applications (e.g., cellular network configuration, dynamic pricing, and policy selection) where an action taken by a central entity influences a large population that contributes to the global reward. However, collecting such reward feedback from the entire population not only incurs a prohibitively high cost but often leads to privacy concerns. To tackle this problem, we consider differentially private distributed linear bandits, where only a subset of users from the population are selected (called clients) to participate in the learning process and the central server learns the global model from such partial feedback by iteratively aggregating these clients' local feedback in a differentially private fashion. We then propose a unified algorithmic learning framework, called differentially private distributed phased elimination (DP-DPE), which can be naturally integrated with popular differential privacy (DP) models (including central DP, local DP, and shuffle DP). Furthermore, we prove that DP-DPE achieves both sublinear regret and sublinear communication cost. Interestingly, DP-DPE also achieves privacy protection ``for free'' in the sense that the additional cost due to privacy guarantees is a lower-order additive term. In addition, as a by-product of our techniques, the same results of ``free" privacy can also be achieved for the standard differentially private linear bandits. Finally, we conduct simulations to corroborate our theoretical results and demonstrate the effectiveness of DP-DPE.
comment: 69 pages, this version is an extension from the preliminary one presented at IEEE/IFIP WiOpt 2022 and was accepted to IEEE Transactions on Network Science and Engineering (TNSE)
♻ ☆ Understanding Information Disclosure from Secure Computation Output: A Study of Average Salary Computation SP
Secure multi-party computation has seen substantial performance improvements in recent years and is being increasingly used in commercial products. While a significant amount of work was dedicated to improving its efficiency under standard security models, the threat models do not account for information leakage from the output of secure function evaluation. Quantifying information disclosure about private inputs from observing the function outcome is the subject of this work. Motivated by the City of Boston gender pay gap studies, in this work we focus on the computation of the average of salaries and quantify information disclosure about private inputs of one or more participants (the target) to an adversary via information-theoretic techniques. We study a number of distributions including log-normal, which is typically used for modeling salaries. We consequently evaluate information disclosure after repeated evaluation of the average function on overlapping inputs, as was done in the Boston gender pay study that ran multiple times, and provide recommendations for using the sum and average functions in secure computation applications. Our goal is to develop mechanisms that lower information disclosure about participants' inputs to a desired level and provide guidelines for setting up real-world secure evaluation of this function.
comment: This is the full version of our conference paper, appearing in the proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy (CODASPY), Porto, Portugal, 2024
♻ ☆ Dodging DeepFake Detection via Implicit Spatial-Domain Notch Filtering
The current high-fidelity generation and high-precision detection of DeepFake images are at an arms race. We believe that producing DeepFakes that are highly realistic and 'detection evasive' can serve the ultimate goal of improving future generation DeepFake detection capabilities. In this paper, we propose a simple yet powerful pipeline to reduce the artifact patterns of fake images without hurting image quality by performing implicit spatial-domain notch filtering. We first demonstrate that frequency-domain notch filtering, although famously shown to be effective in removing periodic noise in the spatial domain, is infeasible for our task at hand due to the manual designs required for the notch filters. We, therefore, resort to a learning-based approach to reproduce the notch filtering effects, but solely in the spatial domain. We adopt a combination of adding overwhelming spatial noise for breaking the periodic noise pattern and deep image filtering to reconstruct the noise-free fake images, and we name our method DeepNotch. Deep image filtering provides a specialized filter for each pixel in the noisy image, producing filtered images with high fidelity compared to their DeepFake counterparts. Moreover, we also use the semantic information of the image to generate an adversarial guidance map to add noise intelligently. Our large-scale evaluation on 3 representative state-of-the-art DeepFake detection methods (tested on 16 types of DeepFakes) has demonstrated that our technique significantly reduces the accuracy of these 3 fake image detection methods, 36.79% on average and up to 97.02% in the best case.
comment: 14 pages
♻ ☆ TTPXHunter: Actionable Threat Intelligence Extraction as TTPs from Finished Cyber Threat Reports
Understanding the modus operandi of adversaries aids organizations in employing efficient defensive strategies and sharing intelligence in the community. This knowledge is often present in unstructured natural language text within threat analysis reports. A translation tool is needed to interpret the modus operandi explained in the sentences of the threat report and translate it into a structured format. This research introduces a methodology named TTPXHunter for the automated extraction of threat intelligence in terms of Tactics, Techniques, and Procedures (TTPs) from finished cyber threat reports. It leverages cyber domain-specific state-of-the-art natural language processing (NLP) to augment sentences for minority class TTPs and refine pinpointing the TTPs in threat analysis reports significantly. The knowledge of threat intelligence in terms of TTPs is essential for comprehensively understanding cyber threats and enhancing detection and mitigation strategies. We create two datasets: an augmented sentence-TTP dataset of 39,296 samples and a 149 real-world cyber threat intelligence report-to-TTP dataset. Further, we evaluate TTPXHunter on the augmented sentence dataset and the cyber threat reports. The TTPXHunter achieves the highest performance of 92.42% f1-score on the augmented dataset, and it also outperforms existing state-of-the-art solutions in TTP extraction by achieving an f1-score of 97.09% when evaluated over the report dataset. TTPXHunter significantly improves cybersecurity threat intelligence by offering quick, actionable insights into attacker behaviors. This advancement automates threat intelligence analysis, providing a crucial tool for cybersecurity professionals fighting cyber threats.
comment: Under Review
Computation and Language
☆ MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io
comment: 46 Pages, Work in Progress, Benchmark Project Page: https://mathverse-cuhk.github.io
☆ ReAct Meets ActRe: Autonomous Annotations of Agent Trajectories for Contrastive Self-Training
Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotations or implementations of diverse prompting frameworks. In this work, we propose A$^3$T, a framework that enables the Autonomous Annotation of Agent Trajectories in the style of ReAct. The central role is an ActRe prompting agent, which explains the reason for an arbitrary action. When randomly sampling an external action, the ReAct-style agent could query the ActRe agent with the action to obtain its textual rationales. Novel trajectories are then synthesized by prepending the posterior reasoning from ActRe to the sampled action. In this way, the ReAct-style agent executes multiple trajectories for the failed tasks, and selects the successful ones to supplement its failed trajectory for contrastive self-training. Realized by policy gradient methods with binarized rewards, the contrastive self-training with accumulated trajectories facilitates a closed loop for multiple rounds of language agent self-improvement. We conduct experiments using QLoRA fine-tuning with the open-sourced Mistral-7B-Instruct-v0.2. In AlfWorld, the agent trained with A$^3$T obtains a 1-shot success rate of 96%, and 100% success with 4 iterative rounds. In WebShop, the 1-shot performance of the A$^3$T agent matches human average, and 4 rounds of iterative refinement lead to the performance approaching human experts. A$^3$T agents significantly outperform existing techniques, including prompting with GPT-4, advanced agent frameworks, and fully fine-tuned LLMs.
☆ Large Language Models for Multi-Choice Question Classification of Medical Subjects
The aim of this paper is to evaluate whether large language models trained on multi-choice question data can be used to discriminate between medical subjects. This is an important and challenging task for automatic question answering. To achieve this goal, we train deep neural networks for multi-class classification of questions into the inferred medical subjects. Using our Multi-Question (MQ) Sequence-BERT method, we outperform the state-of-the-art results on the MedMCQA dataset with an accuracy of 0.68 and 0.60 on their development and test sets, respectively. In this sense, we show the capability of AI and LLMs in particular for multi-classification tasks in the Healthcare domain.
☆ A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science
This paper explores the use of large language models (LLMs) to score and explain short-answer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.
comment: In press at EAAI-24: The 14th Symposium on Educational Advances in Artificial Intelligence
☆ The Era of Semantic Decoding
Recent work demonstrated great promise in the idea of orchestrating collaborations between LLMs, human input, and various tools to address the inherent limitations of LLMs. We propose a novel perspective called semantic decoding, which frames these collaborative processes as optimization procedures in semantic space. Specifically, we conceptualize LLMs as semantic processors that manipulate meaningful pieces of information that we call semantic tokens (known thoughts). LLMs are among a large pool of other semantic processors, including humans and tools, such as search engines or code executors. Collectively, semantic processors engage in dynamic exchanges of semantic tokens to progressively construct high-utility outputs. We refer to these orchestrated interactions among semantic processors, optimizing and searching in semantic space, as semantic decoding algorithms. This concept draws a direct parallel to the well-studied problem of syntactic decoding, which involves crafting algorithms to best exploit auto-regressive language models for extracting high-utility sequences of syntactic tokens. By focusing on the semantic level and disregarding syntactic details, we gain a fresh perspective on the engineering of AI systems, enabling us to imagine systems with much greater complexity and capabilities. In this position paper, we formalize the transition from syntactic to semantic tokens as well as the analogy between syntactic and semantic decoding. Subsequently, we explore the possibilities of optimizing within the space of semantic tokens via semantic decoding algorithms. We conclude with a list of research opportunities and questions arising from this fresh perspective. The semantic decoding perspective offers a powerful abstraction for search and optimization directly in the space of meaningful concepts, with semantic tokens as the fundamental units of a new type of computation.
comment: 25 pages, 3 figures
☆ Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.
☆ EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling
Recently, Large Language Models (LLMs) have demonstrated outstanding performance across a wide range of downstream language tasks. Temperature sampling is a commonly used decoding strategy for LLMs' generation process. However, a fixed temperature parameter is used in most cases, which may not always be an optimal choice for balancing generation quality and diversity. In this paper, we propose an effective Entropy-based Dynamic Temperature (EDT) Sampling method, to achieve a more balanced performance in terms of both generation quality and diversity by dynamically selecting the temperature parameter. Additionally, we also show model performance and comprehensive analyses for 4 different generation benchmarks. Our experiments show that EDT significantly outperforms the existing strategies across different tasks.
☆ Building a Language-Learning Game for Brazilian Indigenous Languages: A Case of Study
In this paper we discuss a first attempt to build a language learning game for brazilian indigenous languages and the challenges around it. We present a design for the tool with gamification aspects. Then we describe a process to automatically generate language exercises and questions from a dependency treebank and a lexical database for Tupian languages. We discuss the limitations of our prototype highlighting ethical and practical implementation concerns. Finally, we conclude that new data gathering processes should be established in partnership with indigenous communities and oriented for educational purposes.
comment: First Workshop on NLP for Indigenous Languages of Lusophone Countries, 16th International Conference on Computational Processing of Portuguese (PROPOR 2024)
☆ Detoxifying Large Language Models via Knowledge Editing
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments to compare knowledge editing approaches with previous baselines, indicating that knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxify approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.
comment: Ongoing work. Project website: https://zjunlp.github.io/project/SafeEdit Benchmark: https://huggingface.co/datasets/zjunlp/SafeEdit Code: https://github.com/zjunlp/EasyEdit
☆ ChatGPT Alternative Solutions: Large Language Models Survey
In recent times, the grandeur of Large Language Models (LLMs) has not only shone in the realm of natural language processing but has also cast its brilliance across a vast array of applications. This remarkable display of LLM capabilities has ignited a surge in research contributions within this domain, spanning a diverse spectrum of topics. These contributions encompass advancements in neural network architecture, context length enhancements, model alignment, training datasets, benchmarking, efficiency improvements, and more. Recent years have witnessed a dynamic synergy between academia and industry, propelling the field of LLM research to new heights. A notable milestone in this journey is the introduction of ChatGPT, a powerful AI chatbot grounded in LLMs, which has garnered widespread societal attention. The evolving technology of LLMs has begun to reshape the landscape of the entire AI community, promising a revolutionary shift in the way we create and employ AI algorithms. Given this swift-paced technical evolution, our survey embarks on a journey to encapsulate the recent strides made in the world of LLMs. Through an exploration of the background, key discoveries, and prevailing methodologies, we offer an up-to-the-minute review of the literature. By examining multiple LLM models, our paper not only presents a comprehensive overview but also charts a course that identifies existing challenges and points toward potential future research trajectories. This survey furnishes a well-rounded perspective on the current state of generative AI, shedding light on opportunities for further exploration, enhancement, and innovation.
☆ Recourse for reclamation: Chatting with generative language models CHI
Researchers and developers increasingly rely on toxicity scoring to moderate generative language model outputs, in settings such as customer service, information retrieval, and content generation. However, toxicity scoring may render pertinent information inaccessible, rigidify or "value-lock" cultural norms, and prevent language reclamation processes, particularly for marginalized people. In this work, we extend the concept of algorithmic recourse to generative language models: we provide users a novel mechanism to achieve their desired prediction by dynamically setting thresholds for toxicity filtering. Users thereby exercise increased agency relative to interactions with the baseline system. A pilot study ($n = 30$) supports the potential of our proposed recourse mechanism, indicating improvements in usability compared to fixed-threshold toxicity-filtering of model outputs. Future work should explore the intersection of toxicity scoring, model controllability, user agency, and language reclamation processes -- particularly with regard to the bias that many communities encounter when interacting with generative language models.
comment: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA 2024)
☆ Towards Single-System Illusion in Software-Defined Vehicles -- Automated, AI-Powered Workflow
We propose a novel model- and feature-based approach to development of vehicle software systems, where the end architecture is not explicitly defined. Instead, it emerges from an iterative process of search and optimization given certain constraints, requirements and hardware architecture, while retaining the property of single-system illusion, where applications run in a logically uniform environment. One of the key points of the presented approach is the inclusion of modern generative AI, specifically Large Language Models (LLMs), in the loop. With the recent advances in the field, we expect that the LLMs will be able to assist in processing of requirements, generation of formal system models, as well as generation of software deployment specification and test code. The resulting pipeline is automated to a large extent, with feedback being generated at each step.
☆ Multi-Level Explanations for Generative Language Models
Perturbation-based explanation methods such as LIME and SHAP are commonly applied to text classification. This work focuses on their extension to generative language models. To address the challenges of text as output and long text inputs, we propose a general framework called MExGen that can be instantiated with different attribution algorithms. To handle text output, we introduce the notion of scalarizers for mapping text to real numbers and investigate multiple possibilities. To handle long inputs, we take a multi-level approach, proceeding from coarser levels of granularity to finer ones, and focus on algorithms with linear scaling in model queries. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and context-grounded question answering. The results show that our framework can provide more locally faithful explanations of generated outputs.
☆ gTBLS: Generating Tables from Text by Conditional Question Answering
Distilling large, unstructured text into a structured, condensed form such as tables is an open research problem. One of the primary challenges in automatically generating tables is ensuring their syntactic validity. Prior approaches address this challenge by including additional parameters in the Transformer's attention mechanism to attend to specific rows and column headers. In contrast to this single-stage method, this paper presents a two-stage approach called Generative Tables (gTBLS). The first stage infers table structure (row and column headers) from the text. The second stage formulates questions using these headers and fine-tunes a causal language model to answer them. Furthermore, the gTBLS approach is amenable to the utilization of pre-trained Large Language Models in a zero-shot configuration, presenting a solution for table generation in situations where fine-tuning is not feasible. gTBLS improves prior approaches by up to 10% in BERTScore on the table construction task and up to 20% on the table content generation task of the E2E, WikiTableText, WikiBio, and RotoWire datasets.
comment: 12 pages, 1 figure
☆ Prediction of Translation Techniques for the Translation Process
Machine translation (MT) encompasses a variety of methodologies aimed at enhancing the accuracy of translations. In contrast, the process of human-generated translation relies on a wide range of translation techniques, which are crucial for ensuring linguistic adequacy and fluency. This study suggests that these translation techniques could further optimize machine translation if they are automatically identified before being applied to guide the translation process effectively. The study differentiates between two scenarios of the translation process: from-scratch translation and post-editing. For each scenario, a specific set of experiments has been designed to forecast the most appropriate translation techniques. The findings indicate that the predictive accuracy for from-scratch translation reaches 82%, while the post-editing process exhibits even greater potential, achieving an accuracy rate of 93%.
comment: 11 pages, 6 figures, conference
☆ More than Just Statistical Recurrence: Human and Machine Unsupervised Learning of Māori Word Segmentation across Morphological Processes
Non-M\=aori-speaking New Zealanders (NMS)are able to segment M\=aori words in a highlysimilar way to fluent speakers (Panther et al.,2024). This ability is assumed to derive through the identification and extraction of statistically recurrent forms. We examine this assumption by asking how NMS segmentations compare to those produced by Morfessor, an unsupervised machine learning model that operates based on statistical recurrence, across words formed by a variety of morphological processes. Both NMS and Morfessor succeed in segmenting words formed by concatenative processes (compounding and affixation without allomorphy), but NMS also succeed for words that invoke templates (reduplication and allomorphy) and other cues to morphological structure, implying that their learning process is sensitive to more than just statistical recurrence.
comment: 10 pages, 1 Figure, 2 tables
Language Models Can Reduce Asymmetry in Information Markets
This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.
☆ A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
comment: arXiv admin note: text overlap with arXiv:2312.03632
☆ Emergent communication and learning pressures in language models: a language evolution perspective
Language models and humans are two types of learning systems. Finding or facilitating commonalities could enable major breakthroughs in our understanding of the acquisition and evolution of language. Many theories of language evolution rely heavily on learning biases and learning pressures. Yet due to substantial differences in learning pressures, it is questionable whether the similarity between humans and machines is sufficient for insights to carry over and to be worth testing with human participants. Here, we review the emergent communication literature, a subfield of multi-agent reinforcement learning, from a language evolution perspective. We find that the emergent communication literature excels at designing and adapting models to recover initially absent linguistic phenomena of natural languages. Based on a short literature review, we identify key pressures that have recovered initially absent human patterns in emergent communication models: communicative success, efficiency, learnability, and other psycho-/sociolinguistic factors. We argue that this may serve as inspiration for how to design language models for language acquisition and language evolution research.
comment: 12 pages
☆ Locating and Mitigating Gender Bias in Large Language Models
Large language models(LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. However, this process can inadvertently lead to these models acquiring biases and stereotypes prevalent in society. Prior research has typically tackled the issue of bias through a one-dimensional perspective, concentrating either on locating or mitigating it. This limited perspective has created obstacles in facilitating research on bias to synergistically complement and progressively build upon one another. In this study, we integrate the processes of locating and mitigating bias within a unified framework. Initially, we use causal mediation analysis to trace the causal effects of different components' activation within a large language model. Building on this, we propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, and compare it against two baselines on three gender bias datasets and seven knowledge competency test datasets. The experimental results indicate that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns and the top attention module acting on the final word in the sentence. Furthermore, LSDM mitigates gender bias in the model more effectively than the other baselines, while fully preserving the model's capabilities in all other aspects.
comment: 23 pages, 5 figures
☆ Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity NAACL 2024
Retrieval-Augmented Large Language Models (LLMs), which incorporate the non-parametric knowledge from external knowledge bases into LLMs, have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA). However, even though there are various approaches dealing with queries of different complexities, they either handle simple queries with unnecessary computational overhead or fail to adequately address complex multi-step queries; yet, not all user requests fall into only one of the simple or complex categories. In this work, we propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs from the simplest to the most sophisticated ones based on the query complexity. Also, this selection process is operationalized with a classifier, which is a smaller LM trained to predict the complexity level of incoming queries with automatically collected labels, obtained from actual predicted outcomes of models and inherent inductive biases in datasets. This approach offers a balanced strategy, seamlessly adapting between the iterative and single-step retrieval-augmented LLMs, as well as the no-retrieval methods, in response to a range of query complexities. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems, compared to relevant baselines including the adaptive retrieval approaches. Code is available at: https://github.com/starsuzi/Adaptive-RAG.
comment: NAACL 2024
☆ XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
☆ Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning
Translation-tailored Large language models (LLMs) exhibit remarkable translation capabilities, even competing with supervised-trained commercial translation systems. However, off-target translation remains an unsolved problem, especially for low-resource languages, hindering us from developing accurate LLMs-based translation models. To mitigate the off-target translation problem and enhance the performance of LLMs on translation, recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs by feeding few-shot demonstrations. However, these methods essentially do not improve LLM's ability to follow translation instructions, especially the language direction information. In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs. Specifically, we first tune LLMs with the maximum likelihood estimation loss on the translation dataset to elicit the basic translation capabilities. In the second stage, we construct instruction-conflicting samples by randomly replacing the translation directions with a wrong one within the instruction, and then introduce an extra unlikelihood loss to learn those samples. Experiments on IWSLT and WMT benchmarks upon the LLaMA model spanning 16 zero-shot directions show that, compared to the competitive baseline -- translation-finetuned LLama, our method could effectively reduce the off-target translation ratio (averagely -53.3\%), thus improving translation quality with average +5.7 SacreBLEU and +16.4 BLEURT. Analysis shows that our method could preserve the model's general task performance on AlpacaEval. Code and models will be released at \url{https://github.com/alphadl/LanguageAware_Tuning}.
☆ From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision
Addressing the challenge of high annotation costs in solving Math Word Problems (MWPs) through full supervision with intermediate equations, recent works have proposed weakly supervised task settings that rely solely on the final answer as a supervised signal. Existing leading approaches typically employ various search techniques to infer intermediate equations, but cannot ensure their semantic consistency with natural language descriptions. The rise of Large Language Models (LLMs) like ChatGPT has opened up new possibilities for addressing MWPs directly. However, the computational demands of LLMs make them less than ideal for use in settings where resources are tight. In light of these challenges, we introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models. In \emph{Distillation Stage}, we propose a series of extraction processes that satisfy the properties of MWPs to distill mathematical knowledge from LLMs to construct problem-equation pairs required for supervised training. In \emph{Refinement Stage}, Due to Knowledge distilling method cannot guarantee the full utilization of all data, we further utilize the unsuccessfully searched data effectively by Knowledge Refine method. Finally, We train a small model using distilled data generated through two-stage methods. As our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair, it demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods, while maintaining a much lower computational cost than ChatGPT.
☆ Editing Knowledge Representation of Language Lodel via Rephrased Prefix Prompts
Neural language models (LMs) have been extensively trained on vast corpora to store factual knowledge about various aspects of the world described in texts. Current technologies typically employ knowledge editing methods or specific prompts to modify LM outputs. However, existing knowledge editing methods are costly and inefficient, struggling to produce appropriate text. Additionally, prompt engineering is opaque and requires significant effort to find suitable prompts. To address these issues, we introduce a new method called PSPEM (Prefix Soft Prompt Editing Method), that can be used for a lifetime with just one training. It resolves the inefficiencies and generalizability issues in knowledge editing methods and overcomes the opacity of prompt engineering by automatically seeking optimal soft prompts. Specifically, PSPEM utilizes a prompt encoder and an encoding converter to refine key information in prompts and uses prompt alignment techniques to guide model generation, ensuring text consistency and adherence to the intended structure and content, thereby maintaining an optimal balance between efficiency and accuracy. We have validated the effectiveness of PSPEM through knowledge editing and attribute inserting. On the COUNTERFACT dataset, PSPEM achieved nearly 100\% editing accuracy and demonstrated the highest level of fluency. We further analyzed the similarities between PSPEM and original prompts and their impact on the model's internals. The results indicate that PSPEM can serve as an alternative to original prompts, supporting the model in effective editing.
comment: 19pages,3figures
☆ FIT-RAG: Black-Box RAG with Factual Information and Token Reduction
Due to the extraordinarily large number of parameters, fine-tuning Large Language Models (LLMs) to update long-tail or out-of-date knowledge is impractical in lots of applications. To avoid fine-tuning, we can alternatively treat a LLM as a black-box (i.e., freeze the parameters of the LLM) and augment it with a Retrieval-Augmented Generation (RAG) system, namely black-box RAG. Recently, black-box RAG has achieved success in knowledge-intensive tasks and has gained much attention. Existing black-box RAG methods typically fine-tune the retriever to cater to LLMs' preferences and concatenate all the retrieved documents as the input, which suffers from two issues: (1) Ignorance of Factual Information. The LLM preferred documents may not contain the factual information for the given question, which can mislead the retriever and hurt the effectiveness of black-box RAG; (2) Waste of Tokens. Simply concatenating all the retrieved documents brings large amounts of unnecessary tokens for LLMs, which degenerates the efficiency of black-box RAG. To address these issues, this paper proposes a novel black-box RAG framework which utilizes the factual information in the retrieval and reduces the number of tokens for augmentation, dubbed FIT-RAG. FIT-RAG utilizes the factual information by constructing a bi-label document scorer. Besides, it reduces the tokens by introducing a self-knowledge recognizer and a sub-document-level token reducer. FIT-RAG achieves both superior effectiveness and efficiency, which is validated by extensive experiments across three open-domain question-answering datasets: TriviaQA, NQ and PopQA. FIT-RAG can improve the answering accuracy of Llama2-13B-Chat by 14.3\% on TriviaQA, 19.9\% on NQ and 27.5\% on PopQA, respectively. Furthermore, it can save approximately half of the tokens on average across the three datasets.
☆ WikiFactDiff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models LREC
The factuality of large language model (LLMs) tends to decay over time since events posterior to their training are "unknown" to them. One way to keep models up-to-date could be factual update: the task of inserting, replacing, or removing certain simple (atomic) facts within the model. To study this task, we present WikiFactDiff, a dataset that describes the evolution of factual knowledge between two dates as a collection of simple facts divided into three categories: new, obsolete, and static. We describe several update scenarios arising from various combinations of these three types of basic update. The facts are represented by subject-relation-object triples; indeed, WikiFactDiff was constructed by comparing the state of the Wikidata knowledge base at 4 January 2021 and 27 February 2023. Those fact are accompanied by verbalization templates and cloze tests that enable running update algorithms and their evaluation metrics. Contrary to other datasets, such as zsRE and CounterFact, WikiFactDiff constitutes a realistic update setting that involves various update scenarios, including replacements, archival, and new entity insertions. We also present an evaluation of existing update algorithms on WikiFactDiff.
comment: Accepted for publication at LREC-COLING 2024
☆ Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives
In this paper, we introduce the Financial-STS task, a financial domain-specific NLP task designed to measure the nuanced semantic similarity between pairs of financial narratives. These narratives originate from the financial statements of the same company but correspond to different periods, such as year-over-year comparisons. Measuring the subtle semantic differences between these paired narratives enables market stakeholders to gauge changes over time in the company's financial and operational situations, which is critical for financial decision-making. We find that existing pretrained embedding models and LLM embeddings fall short in discerning these subtle financial narrative shifts. To address this gap, we propose an LLM-augmented pipeline specifically designed for the Financial-STS task. Evaluation on a human-annotated dataset demonstrates that our proposed method outperforms existing methods trained on classic STS tasks and generic LLM embeddings.
☆ $\nabla τ$: Gradient-based and Task-Agnostic machine Unlearning
Machine Unlearning, the process of selectively eliminating the influence of certain data examples used during a model's training, has gained significant attention as a means for practitioners to comply with recent data protection regulations. However, existing unlearning methods face critical drawbacks, including their prohibitively high cost, often associated with a large number of hyperparameters, and the limitation of forgetting only relatively small data portions. This often makes retraining the model from scratch a quicker and more effective solution. In this study, we introduce Gradient-based and Task-Agnostic machine Unlearning ($\nabla \tau$), an optimization framework designed to remove the influence of a subset of training data efficiently. It applies adaptive gradient ascent to the data to be forgotten while using standard gradient descent for the remaining data. $\nabla \tau$ offers multiple benefits over existing approaches. It enables the unlearning of large sections of the training dataset (up to 30%). It is versatile, supporting various unlearning tasks (such as subset forgetting or class removal) and applicable across different domains (images, text, etc.). Importantly, $\nabla \tau$ requires no hyperparameter adjustments, making it a more appealing option than retraining the model from scratch. We evaluate our framework's effectiveness using a set of well-established Membership Inference Attack metrics, demonstrating up to 10% enhancements in performance compared to state-of-the-art methods without compromising the original model's accuracy.
comment: 14 pages, 2 figures
☆ ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting LREC
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs), establishing itself as a primary approach to solving complex reasoning tasks. Existing CoT synthesis approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. In response to this challenge, we present an empirical investigation of CoT prompting and introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts. CoTGenius is developed based on three major evolution strategies, i.e., complicate, diversify, and specify-alongside two filtering mechanisms: evolutionary success judgement and correctness verification. We further employ CoTGenius to create an extensive CoT dataset, and subsequently fine-tune the Llama 2-Chat 7B and 13B models on this dataset. We call the resulting model ChainLM. To deal with the cumulative error issue in reasoning steps, we propose a step-level debating method, wherein multiple debaters discuss each reasoning step to arrive at the correct answer. Extensive experiments demonstrate that our ChainLM models exhibit enhanced proficiency in addressing a spectrum of complex reasoning problems compared to existing models. In addition, we conduct an in-depth analysis of the impact of data categories within CoTGenius on the model performance. We release our dataset and code at https://github.com/RUCAIBox/ChainLM.
comment: Accepted to LREC-COLING 2024
☆ Is Reference Necessary in the Evaluation of NLG Systems? When and Where?
The majority of automatic metrics for evaluating NLG systems are reference-based. However, the challenge of collecting human annotation results in a lack of reliable references in numerous application scenarios. Despite recent advancements in reference-free metrics, it has not been well understood when and where they can be used as an alternative to reference-based metrics. In this study, by employing diverse analytical approaches, we comprehensively assess the performance of both metrics across a wide range of NLG tasks, encompassing eight datasets and eight evaluation models. Based on solid experiments, the results show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. However, their effectiveness varies across tasks and is influenced by the quality of candidate texts. Therefore, it's important to assess the performance of reference-free metrics before applying them to a new task, especially when inputs are in uncommon form or when the answer space is highly variable. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
☆ Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
☆ LLM-based Extraction of Contradictions from Patents
Already since the 1950s TRIZ shows that patents and the technical contradictions they solve are an important source of inspiration for the development of innovative products. However, TRIZ is a heuristic based on a historic patent analysis and does not make use of the ever-increasing number of latest technological solutions in current patents. Because of the huge number of patents, their length, and, last but not least, their complexity there is a need for modern patent retrieval and patent analysis to go beyond keyword-oriented methods. Recent advances in patent retrieval and analysis mainly focus on dense vectors based on neural AI Transformer language models like Google BERT. They are, for example, used for dense retrieval, question answering or summarization and key concept extraction. A research focus within the methods for patent summarization and key concept extraction are generic inventive concepts respectively TRIZ concepts like problems, solutions, advantage of invention, parameters, and contradictions. Succeeding rule-based approaches, finetuned BERT-like language models for sentence-wise classification represent the state-of-the-art of inventive concept extraction. While they work comparatively well for basic concepts like problems or solutions, contradictions - as a more complex abstraction - remain a challenge for these models. This paper goes one step further, as it presents a method to extract TRIZ contradictions from patent texts based on Prompt Engineering using a generative Large Language Model (LLM), namely OpenAI's GPT-4. Contradiction detection, sentence extraction, contradiction summarization, parameter extraction and assignment to the 39 abstract TRIZ engineering parameters are all performed in a single prompt using the LangChain framework. Our results show that "off-the-shelf" GPT-4 is a serious alternative to existing approaches.
comment: 10 pages, 2 tables
☆ ERD: A Framework for Improving LLM Reasoning for Cognitive Distortion Classification
Improving the accessibility of psychotherapy with the aid of Large Language Models (LLMs) is garnering a significant attention in recent years. Recognizing cognitive distortions from the interviewee's utterances can be an essential part of psychotherapy, especially for cognitive behavioral therapy. In this paper, we propose ERD, which improves LLM-based cognitive distortion classification performance with the aid of additional modules of (1) extracting the parts related to cognitive distortion, and (2) debating the reasoning steps by multiple agents. Our experimental results on a public dataset show that ERD improves the multi-class F1 score as well as binary specificity score. Regarding the latter score, it turns out that our method is effective in debiasing the baseline method which has high false positive rate, especially when the summary of multi-agent debate is provided to LLMs.
☆ K-Act2Emo: Korean Commonsense Knowledge Graph for Indirect Emotional Expression
In many literary texts, emotions are indirectly conveyed through descriptions of actions, facial expressions, and appearances, necessitating emotion inference for narrative understanding. In this paper, we introduce K-Act2Emo, a Korean commonsense knowledge graph (CSKG) comprising 1,900 indirect emotional expressions and the emotions inferable from them. We categorize reasoning types into inferences in positive situations, inferences in negative situations, and inferences when expressions do not serve as emotional cues. Unlike existing CSKGs, K-Act2Emo specializes in emotional contexts, and experimental results validate its effectiveness for training emotion inference models. Significantly, the BART-based knowledge model fine-tuned with K-Act2Emo outperforms various existing Korean large language models, achieving performance levels comparable to GPT-4 Turbo.
comment: 10 pages
☆ LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding LREC
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale language models (LLMs). By leveraging the strengths of existing research in document image understanding and LLMs' superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
comment: LREC-COLING 2024
☆ Dermacen Analytica: A Novel Methodology Integrating Multi-Modal Large Language Models with Machine Learning in tele-dermatology
The rise of Artificial Intelligence creates great promise in the field of medical discovery, diagnostics and patient management. However, the vast complexity of all medical domains require a more complex approach that combines machine learning algorithms, classifiers, segmentation algorithms and, lately, large language models. In this paper, we describe, implement and assess an Artificial Intelligence-empowered system and methodology aimed at assisting the diagnosis process of skin lesions and other skin conditions within the field of dermatology that aims to holistically address the diagnostic process in this domain. The workflow integrates large language, transformer-based vision models and sophisticated machine learning tools. This holistic approach achieves a nuanced interpretation of dermatological conditions that simulates and facilitates a dermatologist's workflow. We assess our proposed methodology through a thorough cross-model validation technique embedded in an evaluation pipeline that utilizes publicly available medical case studies of skin conditions and relevant images. To quantitatively score the system performance, advanced machine learning and natural language processing tools are employed which focus on similarity comparison and natural language inference. Additionally, we incorporate a human expert evaluation process based on a structured checklist to further validate our results. We implemented the proposed methodology in a system which achieved approximate (weighted) scores of 0.87 for both contextual understanding and diagnostic accuracy, demonstrating the efficacy of our approach in enhancing dermatological analysis. The proposed methodology is expected to prove useful in the development of next-generation tele-dermatology applications, enhancing remote consultation capabilities and access to care, especially in underserved areas.
☆ Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection ACL 2024
Despite the promise of RLHF in aligning LLMs with human preferences, it often leads to superficial alignment, prioritizing stylistic changes over improving downstream performance of LLMs. Underspecified preferences could obscure directions to align the models. Lacking exploration restricts identification of desirable outputs to improve the models. To overcome these challenges, we propose a novel framework: Reinforcement Learning from Reflective Feedback (RLRF), which leverages fine-grained feedback based on detailed criteria to improve the core capabilities of LLMs. RLRF employs a self-reflection mechanism to systematically explore and refine LLM responses, then fine-tuning the models via a RL algorithm along with promising responses. Our experiments across Just-Eval, Factuality, and Mathematical Reasoning demonstrate the efficacy and transformative potential of RLRF beyond superficial surface-level adjustment.
comment: 22 pages, 5 figures, Submitted to ACL 2024
☆ A Unified Framework for Model Editing
Model editing is a growing area focused on updating the knowledge embedded within models. Among the various methodologies, ROME and MEMIT stand out as leading "locate-and-edit" model editing techniques. While MEMIT enables batched editing of memories, ROME is limited to changing one fact at a time. This paper introduces a unifying framework that brings ROME and MEMIT under a single conceptual umbrella, optimizing for the same goal, which we call the "preservation-memorization" objective. This objective aims to preserve the representations of certain selected vectors while memorizing the representations of new factual information. Specifically, ROME optimizes this objective using an equality constraint, whereas MEMIT employs a more flexible least-square constraint. In addition to making batched edits, MEMIT also edits the model at multiple layers. We disentangle the distribution of edits to multiple layers from the optimization objective of MEMIT and show that these edit-distribution algorithms should be considered separate entities worthy of their own line of research. Finally, we present EMMET - an Equality-constrained Mass Model Editing algorithm for Transformers, a new batched memory-editing algorithm. With EMMET, we present a closed form solution for the equality-constrained version of the preservation-memorization objective. We show that EMMET is able to perform batched-edits on par with MEMIT up to a batch-size of 256 and discuss the challenges in stabilizing EMMET. By articulating the "locate-and-edit" model editing algorithms under a simple conceptual framework of "preservation-memorization", we aim to bridge the gap between intuition and mathematics and hope to simplify the journey for future researchers in model editing.
☆ Large-Scale Label Interpretation Learning for Few-Shot Named Entity Recognition
Few-shot named entity recognition (NER) detects named entities within text using only a few annotated examples. One promising line of research is to leverage natural language descriptions of each entity type: the common label PER might, for example, be verbalized as ''person entity.'' In an initial label interpretation learning phase, the model learns to interpret such verbalized descriptions of entity types. In a subsequent few-shot tagset extension phase, this model is then given a description of a previously unseen entity type (such as ''music album'') and optionally a few training examples to perform few-shot NER for this type. In this paper, we systematically explore the impact of a strong semantic prior to interpret verbalizations of new entity types by massively scaling up the number and granularity of entity types used for label interpretation learning. To this end, we leverage an entity linking benchmark to create a dataset with orders of magnitude of more distinct entity types and descriptions as currently used datasets. We find that this increased signal yields strong results in zero- and few-shot NER in in-domain, cross-domain, and even cross-lingual settings. Our findings indicate significant potential for improving few-shot NER through heuristical data-based optimization.
comment: 8 pages
☆ Improving the Robustness of Large Language Models via Consistency Alignment LREC
Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.
comment: Accepted by LREC-COLING 2024
☆ Automatic Annotation of Grammaticality in Child-Caregiver Conversations
The acquisition of grammar has been a central question to adjudicate between theories of language acquisition. In order to conduct faster, more reproducible, and larger-scale corpus studies on grammaticality in child-caregiver conversations, tools for automatic annotation can offer an effective alternative to tedious manual annotation. We propose a coding scheme for context-dependent grammaticality in child-caregiver conversations and annotate more than 4,000 utterances from a large corpus of transcribed conversations. Based on these annotations, we train and evaluate a range of NLP models. Our results show that fine-tuned Transformer-based models perform best, achieving human inter-annotation agreement levels.As a first application and sanity check of this tool, we use the trained models to annotate a corpus almost two orders of magnitude larger than the manually annotated data and verify that children's grammaticality shows a steady increase with age.This work contributes to the growing literature on applying state-of-the-art NLP methods to help study child language acquisition at scale.
☆ Context Quality Matters in Training Fusion-in-Decoder for Extractive Open-Domain Question Answering EMNLP
Retrieval-augmented generation models augment knowledge encoded in a language model by providing additional relevant external knowledge (context) during generation. Although it has been shown that the quantity and quality of context impact the performance of retrieval-augmented generation models during inference, limited research explores how these characteristics affect model training. This paper explores how context quantity and quality during model training affect the performance of Fusion-in-Decoder (FiD), the state-of-the-art retrieval-augmented generation model, in extractive open-domain question answering tasks. Experimental results suggest that FiD models overfit to context quality during training and show suboptimal performance when evaluated on different context quality. Through the experimental results, we also reveal FiD models trained with different context quality have different cross-attention distribution patterns. Specifically, as context quality during training increases, FiD models tend to attend more uniformly to each passage in context. Finally, based on these observations, we propose a method to mitigate overfitting to specific context quality by introducing bias to the cross-attention distribution, which we demonstrate to be effective in improving the performance of FiD models on different context quality.
comment: EMNLP Findings 2023
☆ MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation
Automatic detection of multimodal misinformation has gained a widespread attention recently. However, the potential of powerful Large Language Models (LLMs) for multimodal misinformation detection remains underexplored. Besides, how to teach LLMs to interpret multimodal misinformation in cost-effective and accessible way is still an open question. To address that, we propose MMIDR, a framework designed to teach LLMs in providing fluent and high-quality textual explanations for their decision-making process of multimodal misinformation. To convert multimodal misinformation into an appropriate instruction-following format, we present a data augmentation perspective and pipeline. This pipeline consists of a visual information processing module and an evidence retrieval module. Subsequently, we prompt the proprietary LLMs with processed contents to extract rationales for interpreting the authenticity of multimodal misinformation. Furthermore, we design an efficient knowledge distillation approach to distill the capability of proprietary LLMs in explaining multimodal misinformation into open-source LLMs. To explore several research questions regarding the performance of LLMs in multimodal misinformation detection tasks, we construct an instruction-following multimodal misinformation dataset and conduct comprehensive experiments. The experimental findings reveal that our MMIDR exhibits sufficient detection performance and possesses the capacity to provide compelling rationales to support its assessments.
comment: 10 pages, 3 figures
☆ M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset
Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the spoken and written words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M$^3$AV makes it a challenging dataset.
☆ C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion ICLR 2024
In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration-a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data.
comment: ICLR 2024
☆ From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation IJCNN 2024
Machine Translation Quality Estimation (MTQE) is the task of estimating the quality of machine-translated text in real time without the need for reference translations, which is of great importance for the development of MT. After two decades of evolution, QE has yielded a wealth of results. This article provides a comprehensive overview of QE datasets, annotation methods, shared tasks, methodologies, challenges, and future research directions. It begins with an introduction to the background and significance of QE, followed by an explanation of the concepts and evaluation metrics for word-level QE, sentence-level QE, document-level QE, and explainable QE. The paper categorizes the methods developed throughout the history of QE into those based on handcrafted features, deep learning, and Large Language Models (LLMs), with a further division of deep learning-based methods into classic deep learning and those incorporating pre-trained language models (LMs). Additionally, the article details the advantages and limitations of each method and offers a straightforward comparison of different approaches. Finally, the paper discusses the current challenges in QE research and provides an outlook on future research directions.
comment: Accepted by IJCNN 2024
☆ A Design Space for Intelligent and Interactive Writing Assistants CHI 2024
In our era of rapid technological advancement, the research landscape for writing assistants has become increasingly fragmented across various research communities. We seek to address this challenge by proposing a design space as a structured way to examine and explore the multidimensional space of intelligent and interactive writing assistants. Through a large community collaboration, we explore five aspects of writing assistants: task, user, technology, interaction, and ecosystem. Within each aspect, we define dimensions (i.e., fundamental components of an aspect) and codes (i.e., potential options for each dimension) by systematically reviewing 115 papers. Our design space aims to offer researchers and designers a practical tool to navigate, comprehend, and compare the various possibilities of writing assistants, and aid in the envisioning and design of new writing assistants.
comment: Published as a conference paper at CHI 2024
☆ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. We evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5 representative prompt strategies for improving LLMs' reasoning ability, such as Chain-of-Thought. Our findings indicate that the LLM's language orientation and the task's domain influence the effectiveness of the prompt strategy, which enriches previous research findings. We built closely-interconnected reasoning and memorization tasks, and found that some LLMs struggle with memorizing Chinese commonsense, affecting their reasoning ability, while others show differences in reasoning despite similar memorization performance. We also evaluated the LLMs' memorization-independent reasoning abilities and analyzed the typical errors. Our study precisely identified the LLMs' strengths and weaknesses, providing the clear direction for optimization. It can also serve as a reference for studies in other fields. We will release CHARM at https://github.com/opendatalab/CHARM .
comment: Equal contribution: Jiaxing Sun, Weiquan Huang, Jiang Wu; Corresponding author: Conghui He
☆ M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval LREC
In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal retrieval performance. On the other hand, despite many retrieval datasets supporting various learning objectives beyond contrastive learning, combining them efficiently in multi-task learning scenarios can be challenging. In this paper, we introduce M3, an advanced recursive Multi-hop dense sentence retrieval system built upon a novel Multi-task Mixed-objective approach for dense text representation learning, addressing the aforementioned challenges. Our approach yields state-of-the-art performance on a large-scale open-domain fact verification benchmark dataset, FEVER. Code and data are available at: https://github.com/TonyBY/M3
comment: Accepted by LREC-COLING 2024
☆ A Taxonomy of Ambiguity Types for NLP EACL 2024
Ambiguity is an critical component of language that allows for more effective communication between speakers, but is often ignored in NLP. Recent work suggests that NLP systems may struggle to grasp certain elements of human language understanding because they may not handle ambiguities at the level that humans naturally do in communication. Additionally, different types of ambiguity may serve different purposes and require different approaches for resolution, and we aim to investigate how language models' abilities vary across types. We propose a taxonomy of ambiguity types as seen in English to facilitate NLP analysis. Our taxonomy can help make meaningful splits in language ambiguity data, allowing for more fine-grained assessments of both datasets and model performance.
comment: To appear at the UnImplicit workshop at EACL 2024
☆ The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data
The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains. There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas, e.g., computer vision or natural language processing. A major limitation with audio is the available data; with audio being a time-dependent modality, high-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset. In this short white paper, to encourage researchers with limited access to large-datasets, the organizers first outline several open-source datasets that are available to the community, and for the duration of the workshop are making several propriety datasets available. Namely, three vocal datasets, Hume-Prosody, Hume-VocalBurst, an acted emotional speech dataset Modulate-Sonata, and an in-game streamer dataset Modulate-Stream. We outline the current baselines on these datasets but encourage researchers from across audio to utilize them outside of the initial baseline tasks.
☆ DreamReward: Text-to-3D Generation with Human Preference
3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D -- the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models.
comment: Project page: https://jamesyjl.github.io/DreamReward
☆ Text-Enhanced Data-free Approach for Federated Class-Incremental Learning CVPR 2024
Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal issue, involving the dynamic addition of new classes in the context of federated learning. In this field, Data-Free Knowledge Transfer (DFKT) plays a crucial role in addressing catastrophic forgetting and data privacy problems. However, prior approaches lack the crucial synergy between DFKT and the model training phases, causing DFKT to encounter difficulties in generating high-quality data from a non-anchored latent space of the old task model. In this paper, we introduce LANDER (Label Text Centered Data-Free Knowledge Transfer) to address this issue by utilizing label text embeddings (LTE) produced by pretrained language models. Specifically, during the model training phase, our approach treats LTE as anchor points and constrains the feature embeddings of corresponding training samples around them, enriching the surrounding area with more meaningful information. In the DFKT phase, by using these LTE anchors, LANDER can synthesize more meaningful samples, thereby effectively addressing the forgetting problem. Additionally, instead of tightly constraining embeddings toward the anchor, the Bounding Loss is introduced to encourage sample embeddings to remain flexible within a defined radius. This approach preserves the natural differences in sample embeddings and mitigates the embedding overlap caused by heterogeneous federated settings. Extensive experiments conducted on CIFAR100, Tiny-ImageNet, and ImageNet demonstrate that LANDER significantly outperforms previous methods and achieves state-of-the-art performance in FCIL. The code is available at https://github.com/tmtuan1307/lander.
comment: Accepted at CVPR 2024
☆ Extracting Emotion Phrases from Tweets using BART
Sentiment analysis is a natural language processing task that aims to identify and extract the emotional aspects of a text. However, many existing sentiment analysis methods primarily classify the overall polarity of a text, overlooking the specific phrases that convey sentiment. In this paper, we applied an approach to sentiment analysis based on a question-answering framework. Our approach leverages the power of Bidirectional Autoregressive Transformer (BART), a pre-trained sequence-to-sequence model, to extract a phrase from a given text that amplifies a given sentiment polarity. We create a natural language question that identifies the specific emotion to extract and then guide BART to pay attention to the relevant emotional cues in the text. We use a classifier within BART to predict the start and end positions of the answer span within the text, which helps to identify the precise boundaries of the extracted emotion phrase. Our approach offers several advantages over most sentiment analysis studies, including capturing the complete context and meaning of the text and extracting precise token spans that highlight the intended sentiment. We achieved an end loss of 87% and Jaccard score of 0.61.
☆ AutoRE: Document-Level Relation Extraction with Large Language Models
Large Language Models (LLMs) have demonstrated exceptional abilities in comprehending and generating text, motivating numerous researchers to utilize them for Information Extraction (IE) purposes, including Relation Extraction (RE). Nonetheless, most existing methods are predominantly designed for Sentence-level Relation Extraction (SentRE) tasks, which typically encompass a restricted set of relations and triplet facts within a single sentence. Furthermore, certain approaches resort to treating relations as candidate choices integrated into prompt templates, leading to inefficient processing and suboptimal performance when tackling Document-Level Relation Extraction (DocRE) tasks, which entail handling multiple relations and triplet facts distributed across a given document, posing distinct challenges. To overcome these limitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel RE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing approaches, AutoRE does not rely on the assumption of known relation options, making it more reflective of real-world scenarios. Additionally, we have developed an easily extensible RE framework using a Parameters Efficient Fine Tuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset showcase AutoRE's best performance, achieving state-of-the-art results, surpassing TAG by 10.03% and 9.03% respectively on the dev and test set.
comment: 11 pages
☆ VidLA: Video-Language Alignment at Scale CVPR 2024
In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.
comment: Accepted to CVPR 2024
☆ Comparing Plausibility Estimates in Base and Instruction-Tuned Large Language Models
Instruction-tuned LLMs can respond to explicit queries formulated as prompts, which greatly facilitates interaction with human users. However, prompt-based approaches might not always be able to tap into the wealth of implicit knowledge acquired by LLMs during pre-training. This paper presents a comprehensive study of ways to evaluate semantic plausibility in LLMs. We compare base and instruction-tuned LLM performance on an English sentence plausibility task via (a) explicit prompting and (b) implicit estimation via direct readout of the probabilities models assign to strings. Experiment 1 shows that, across model architectures and plausibility datasets, (i) log likelihood ($\textit{LL}$) scores are the most reliable indicator of sentence plausibility, with zero-shot prompting yielding inconsistent and typically poor results; (ii) $\textit{LL}$-based performance is still inferior to human performance; (iii) instruction-tuned models have worse $\textit{LL}$-based performance than base models. In Experiment 2, we show that $\textit{LL}$ scores across models are modulated by context in the expected way, showing high performance on three metrics of context-sensitive plausibility and providing a direct match to explicit human plausibility judgments. Overall, $\textit{LL}$ estimates remain a more reliable measure of plausibility in LLMs than direct prompting.
☆ TAMS: Translation-Assisted Morphological Segmentation ACL
Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes. This is a core task in language documentation, and NLP systems have the potential to dramatically speed up this process. But in typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to train high quality models. However, translation data is often much more abundant, and, in this work, we present a method that attempts to leverage this data in the canonical segmentation task. We propose a character-level sequence-to-sequence model that incorporates representations of translations obtained from pretrained high-resource monolingual language models as an additional signal. Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data. While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
comment: Submitted to ACL ARR on December 15th 2023
☆ The opportunities and risks of large language models in mental health
Global rates of mental health concerns are rising and there is increasing realization that existing models of mental healthcare will not adequately expand to meet the demand. With the emergence of large language models (LLMs) has come great optimism regarding their promise to create novel, large-scale solutions to support mental health. Despite their nascence, LLMs have already been applied to mental health-related tasks. In this review, we summarize the extant literature on efforts to use LLMs to provide mental health education, assessment, and intervention and highlight key opportunities for positive impact in each area. We then highlight risks associated with LLMs application to mental health and encourage adoption of strategies to mitigate these risks. The urgent need for mental health support must be balanced with responsible development, testing, and deployment of mental health LLMs. Especially critical is ensuring that mental health LLMs are fine-tuned for mental health, enhance mental health equity, adhere to ethical standards, and that people, including those with lived experience with mental health concerns, are involved in all stages from development through deployment. Prioritizing these efforts will minimize potential harms to mental health and maximize the likelihood that LLMs will positively impact mental health globally.
comment: 12 pages, 2 tables, 4 figures
☆ A Collection of Pragmatic-Similarity Judgments over Spoken Dialog Utterances LREC 2024
Automatic measures of similarity between utterances are invaluable for training speech synthesizers, evaluating machine translation, and assessing learner productions. While there exist measures for semantic similarity and prosodic similarity, there are as yet none for pragmatic similarity. To enable the training of such measures, we developed the first collection of human judgments of pragmatic similarity between utterance pairs. Each pair consisting of an utterance extracted from a recorded dialog and a re-enactment of that utterance. Re-enactments were done under various conditions designed to create a variety of degrees of similarity. Each pair was rated on a continuous scale by 6 to 9 judges. The average inter-judge correlation was as high as 0.72 for English and 0.66 for Spanish. We make this data available at https://github.com/divettemarco/PragSim .
comment: LREC 2024
☆ Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
comment: A full version of the paper will be released soon. The codes are available at https://github.com/bowen-upenn/Multi-Agent-VQA
☆ Few-Shot Adversarial Prompt Learning on Vision-Language Models
The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Inspired by the success of vision-language foundation models, previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision. However, in practice, they are still unsatisfactory due to several issues, including heavy adaptation cost, suboptimal text supervision, and uncontrolled natural generalization capacity. In this paper, to address these issues, we propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement. Specifically, we achieve this by providing adversarially correlated text supervision that is end-to-end learned from adversarial examples. We also propose a novel training objective that enhances the consistency of multi-modal features while encourages differentiated uni-modal features between natural and adversarial examples. The proposed framework gives access to learn adversarial text supervision, which provides superior cross-modal adversarial alignment and matches state-of-the-art zero-shot adversarial robustness with only 1% training data.
comment: 25 pages, 13 tables, 8 figures
☆ StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video generation (typically 16 or 24 frames), ending up with hard-cuts when naively extended to the case of long video synthesis. To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions. The key components are:(i) a short-term memory block called conditional attention module (CAM), which conditions the current generation on the features extracted from the previous chunk via an attentional mechanism, leading to consistent chunk transitions, (ii) a long-term memory block called appearance preservation module, which extracts high-level scene and object features from the first video chunk to prevent the model from forgetting the initial scene, and (iii) a randomized blending approach that enables to apply a video enhancer autoregressively for infinitely long videos without inconsistencies between chunks. Experiments show that StreamingT2V generates high motion amount. In contrast, all competing image-to-video methods are prone to video stagnation when applied naively in an autoregressive manner. Thus, we propose with StreamingT2V a high-quality seamless text-to-long video generator that outperforms competitors with consistency and motion. Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V
comment: https://github.com/Picsart-AI-Research/StreamingT2V
☆ A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at https://github.com/QiushiSun/NCISurvey.
comment: 64 pages, 6 figures, 10 tables, 688 references
☆ Open Knowledge Base Canonicalization with Multi-task Learning
The construction of large open knowledge bases (OKBs) is integral to many knowledge-driven applications on the world wide web such as web search. However, noun phrases and relational phrases in OKBs often suffer from redundancy and ambiguity, which calls for the investigation on OKB canonicalization. Current solutions address OKB canonicalization by devising advanced clustering algorithms and using knowledge graph embedding (KGE) to further facilitate the canonicalization process. Nevertheless, these works fail to fully exploit the synergy between clustering and KGE learning, and the methods designed for these subtasks are sub-optimal. To this end, we put forward a multi-task learning framework, namely MulCanon, to tackle OKB canonicalization. In addition, diffusion model is used in the soft clustering process to improve the noun phrase representations with neighboring information, which can lead to more accurate representations. MulCanon unifies the learning objectives of these sub-tasks, and adopts a two-stage multi-task learning paradigm for training. A thorough experimental study on popular OKB canonicalization benchmarks validates that MulCanon can achieve competitive canonicalization results.
comment: arXiv admin note: substantial text overlap with arXiv:2310.16419
♻ ☆ TableLlama: Towards Open Large Generalist Models for Tables NAACL 2024
Semi-structured tables are ubiquitous. There has been a variety of tasks that aim to automatically interpret, augment, and query tables. Current methods often require pretraining on tables or special model architecture design, are restricted to specific table types, or have simplifying assumptions about tables and tasks. This paper makes the first step towards developing open-source large language models (LLMs) as generalists for a diversity of table-based tasks. Towards that end, we construct TableInstruct, a new dataset with a variety of realistic tables and tasks, for instruction tuning and evaluating LLMs. We further develop the first open-source generalist model for tables, TableLlama, by fine-tuning Llama 2 (7B) with LongLoRA to address the long context challenge. We experiment under both in-domain setting and out-of-domain setting. On 7 out of 8 in-domain tasks, TableLlama achieves comparable or better performance than the SOTA for each task, despite the latter often has task-specific design. On 6 out-of-domain datasets, it achieves 5-44 absolute point gains compared with the base model, showing that training on TableInstruct enhances the model's generalizability. We open-source our dataset and trained model to boost future work on developing open generalist models for tables.
comment: NAACL 2024 long paper
♻ ☆ m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
♻ ☆ Unraveling the Mystery of Scaling Laws: Part I
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.
♻ ☆ EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
comment: Project website: https://zjunlp.github.io/project/EasyInstruct Code: https://github.com/zjunlp/EasyInstruct Video: https://youtu.be/rfQOWYfziFo Demo: https://huggingface.co/spaces/zjunlp/EasyInstruct
♻ ☆ SignBank+: Preparing a Multilingual Sign Language Dataset for Machine Translation Using Large Language Models
We introduce SignBank+, a clean version of the SignBank dataset, optimized for machine translation between spoken language text and SignWriting, a phonetic sign language writing system. In addition to previous work that employs complex factorization techniques to enable translation between text and SignWriting, we show that a traditional text-to-text translation approach performs equally effectively on the cleaned SignBank+ dataset. Our evaluation results indicate that models trained on SignBank+ surpass those on the original dataset, establishing a new benchmark for SignWriting-based sign language translation and providing an open resource for future research.
♻ ☆ Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.
♻ ☆ Sequence-to-Sequence Spanish Pre-trained Language Models LREC
In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.
comment: Accepted paper at LREC-Coling2024
♻ ☆ Effective Structured Prompting by Meta-Learning and Representative Verbalizer ICML 2023
Prompt tuning for pre-trained masked language models (MLM) has shown promising performance in natural language processing tasks with few labeled examples. It tunes a prompt for the downstream task, and a verbalizer is used to bridge the predicted token and label prediction. Due to the limited training data, prompt initialization is crucial for prompt tuning. Recently, MetaPrompting (Hou et al., 2022) uses meta-learning to learn a shared initialization for all task-specific prompts. However, a single initialization is insufficient to obtain good prompts for all tasks and samples when the tasks are complex. Moreover, MetaPrompting requires tuning the whole MLM, causing a heavy burden on computation and memory as the MLM is usually large. To address these issues, we use a prompt pool to extract more task knowledge and construct instance-dependent prompts via attention. We further propose a novel soft verbalizer (RepVerb) which constructs label embedding from feature embeddings directly. Combining meta-learning the prompt pool and RepVerb, we propose MetaPrompter for effective structured prompting. MetaPrompter is parameter-efficient as only the pool is required to be tuned. Experimental results demonstrate that MetaPrompter performs better than the recent state-of-the-arts and RepVerb outperforms existing soft verbalizers.
comment: Accepted at ICML 2023
♻ ☆ DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.
♻ ☆ Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method NAACL 2024
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks. However, recent literature reveals that LLMs generate nonfactual responses intermittently, which impedes the LLMs' reliability for further utilization. In this paper, we propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results. Specifically, we first diversify the textual expressions for a given question and collect the corresponding answers. Then we examine the divergencies between the generated answers to identify the questions that the model may generate falsehoods. All of the above steps can be accomplished by prompting the LLMs themselves without referring to any other external resources. We conduct comprehensive experiments and demonstrate the effectiveness of our method on recently released LLMs, e.g., Vicuna, ChatGPT, and GPT-4.
comment: Accepted by NAACL 2024
♻ ☆ NewsBench: Systematic Evaluation of LLMs for Writing Proficiency and Safety Adherence in Chinese Journalistic Editorial Applications
This study presents NewsBench, a novel benchmark framework developed to evaluate the capability of Large Language Models (LLMs) in Chinese Journalistic Writing Proficiency (JWP) and their Safety Adherence (SA), addressing the gap between journalistic ethics and the risks associated with AI utilization. Comprising 1,267 tasks across 5 editorial applications, 7 aspects (including safety and journalistic writing with 4 detailed facets), and spanning 24 news topics domains, NewsBench employs two GPT-4 based automatic evaluation protocols validated by human assessment. Our comprehensive analysis of 10 LLMs highlighted GPT-4 and ERNIE Bot as top performers, yet revealed a relative deficiency in journalistic ethic adherence during creative writing tasks. These findings underscore the need for enhanced ethical guidance in AI-generated journalistic content, marking a step forward in aligning AI capabilities with journalistic standards and safety considerations.
comment: 27 pages
♻ ☆ Pluggable Neural Machine Translation Models via Memory-augmented Adapters LREC
Although neural machine translation (NMT) models perform well in the general domain, it remains rather challenging to control their generation behavior to satisfy the requirement of different users. Given the expensive training cost and the data scarcity challenge of learning a new model from scratch for each user requirement, we propose a memory-augmented adapter to steer pretrained NMT models in a pluggable manner. Specifically, we construct a multi-granular memory based on the user-provided text samples and propose a new adapter architecture to combine the model representations and the retrieved results. We also propose a training strategy using memory dropout to reduce spurious dependencies between the NMT model and the memory. We validate our approach on both style- and domain-specific experiments and the results indicate that our method can outperform several representative pluggable baselines.
comment: Accepted by LREC-COLING 2024
♻ ☆ Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation
Existing technologies expand BERT from different perspectives, e.g. designing different pre-training tasks, different semantic granularities, and different model architectures. Few models consider expanding BERT from different text formats. In this paper, we propose a heterogeneous knowledge language model (\textbf{HKLM}), a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text. To capture the corresponding relations among these multi-format knowledge, our approach uses masked language model objective to learn word knowledge, uses triple classification objective and title matching objective to learn entity knowledge and topic knowledge respectively. To obtain the aforementioned multi-format text, we construct a corpus in the tourism domain and conduct experiments on 5 tourism NLP datasets. The results show that our approach outperforms the pre-training of plain text using only 1/4 of the data. We further pre-train the domain-agnostic HKLM and achieve performance gains on the XNLI dataset.
♻ ☆ Reranking Passages with Coarse-to-Fine Neural Retriever Enhanced by List-Context Information
Passage reranking is a critical task in various applications, particularly when dealing with large volumes of documents. Existing neural architectures have limitations in retrieving the most relevant passage for a given question because the semantics of the segmented passages are often incomplete, and they typically match the question to each passage individually, rarely considering contextual information from other passages that could provide comparative and reference information. This paper presents a list-context attention mechanism to augment the passage representation by incorporating the list-context information from other candidates. The proposed coarse-to-fine (C2F) neural retriever addresses the out-of-memory limitation of the passage attention mechanism by dividing the list-context modeling process into two sub-processes with a cache policy learning algorithm, enabling the efficient encoding of context information from a large number of candidate answers. This method can be generally used to encode context information from any number of candidate answers in one pass. Different from most multi-stage information retrieval architectures, this model integrates the coarse and fine rankers into the joint optimization process, allowing for feedback between the two layers to update the model simultaneously. Experiments demonstrate the effectiveness of the proposed approach.
♻ ☆ LLMs and the Human Condition
This paper presents three established theories of human decision-making and describes how they can be integrated to provide a model of purposive human action. Taking seriously the idea of language as action the model is then applied to the conversational user interfaces. Theory based AI research has had a hard time recently and the aim here is to revitalise interest in understanding what LLMs are actually doing other than running poorly understood machine learning routines over all the data the relevant Big Tech company can hoover up. When a raspberry pi computer for under 50USD is up to 400 times faster than the first commercial Cray super computer~\cite{crayVpi}, Big Tech can get really close to having an infinite number of monkeys typing at random and producing text, some of which will make sense. By understanding where ChatGPT's apparent intelligence comes from, perhaps we can perform the magic with fewer resources and at the same time gain some understanding about our relationship with our world.
comment: A 2nd draft with a better abstract and introduction. target is IVA in 2024
♻ ☆ Detecting Sexual Content at the Sentence Level in First Millennium Latin Texts
In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.). We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting. Our results demonstrate the effectiveness of this approach, achieving high precision and true positive rates (TPR) of respectively 70.60% and 86.33% using HAN. We evaluate the impact of the dataset size on the model performances (420 instead of 2013), and show that, while our models perform worse, they still offer a high enough precision and TPR, even without MLM, respectively 69% and 51%. Given the result, we provide an analysis of the attention mechanism as a supporting added value for humanists in order to produce more data.
♻ ☆ LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LlamaFactory, a unified framework that integrates a suite of cutting-edge efficient training methods. It allows users to flexibly customize the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks. It has been released at https://github.com/hiyouga/LLaMA-Factory and already received over 13,000 stars and 1,600 forks.
comment: 12 pages, preprint
♻ ☆ Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models
Advancements in natural language generation (NLG) and large language models (LLMs) have led to proficient text generation in various tasks. However, integrating intricate constraints into neural text generation, due to LLMs' opacity, remains challenging. This study investigates constrained text generation for LLMs, where predefined constraints are applied during LLM's generation process. Our research mainly focuses on mainstream open-source LLMs, categorizing constraints into lexical, structural, and relation-based types. We also present various benchmarks to facilitate fair evaluation. The study addresses some key research questions, including evaluating, understanding and improving constrained text generation for LLMs. Results illuminate LLMs' capacity and deficiency to incorporate constraints and provide insights for future developments in constrained text generation. Codes and datasets will be released upon acceptance.
comment: Work in progress
♻ ☆ Generating Explanations to Understand and Repair Embedding-based Entity Alignment ICDE 2024
Entity alignment (EA) seeks identical entities in different knowledge graphs, which is a long-standing task in the database research. Recent work leverages deep learning to embed entities in vector space and align them via nearest neighbor search. Although embedding-based EA has gained marked success in recent years, it lacks explanations for alignment decisions. In this paper, we present the first framework that can generate explanations for understanding and repairing embedding-based EA results. Given an EA pair produced by an embedding model, we first compare its neighbor entities and relations to build a matching subgraph as a local explanation. We then construct an alignment dependency graph to understand the pair from an abstract perspective. Finally, we repair the pair by resolving three types of alignment conflicts based on dependency graphs. Experiments on a variety of EA datasets demonstrate the effectiveness, generalization, and robustness of our framework in explaining and repairing embedding-based EA results.
comment: Accepted in the 40th IEEE International Conference on Data Engineering (ICDE 2024)
♻ ☆ Less is More: Data Value Estimation for Visual Instruction Tuning
Visual instruction tuning is the key to building multimodal large language models (MLLMs), which greatly improves the reasoning capabilities of large language models (LLMs) in vision scenario. However, existing MLLMs mostly rely on a mixture of multiple highly diverse visual instruction datasets for training (even more than a million instructions), which may introduce data redundancy. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of several instruction dataset even do not affect the performance. Based on the findings, we propose a new data selection approach TIVE, to eliminate redundancy within visual instruction data. TIVE first estimates the task-level and instance-level value of the visual instructions based on computed gradients. Then, according to the estimated values, TIVE determines the task proportion within the visual instructions, and selects representative instances to compose a smaller visual instruction subset for training. Experiments on LLaVA-1.5 show that our approach using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model across seven benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.
♻ ☆ A Challenge Dataset and Effective Models for Conversational Stance Detection
Previous stance detection studies typically concentrate on evaluating stances within individual instances, thereby exhibiting limitations in effectively modeling multi-party discussions concerning the same specific topic, as naturally transpire in authentic social media interactions. This constraint arises primarily due to the scarcity of datasets that authentically replicate real social media contexts, hindering the research progress of conversational stance detection. In this paper, we introduce a new multi-turn conversation stance detection dataset (called \textbf{MT-CSD}), which encompasses multiple targets for conversational stance detection. To derive stances from this challenging dataset, we propose a global-local attention network (\textbf{GLAN}) to address both long and short-range dependencies inherent in conversational data. Notably, even state-of-the-art stance detection methods, exemplified by GLAN, exhibit an accuracy of only 50.47\%, highlighting the persistent challenges in conversational stance detection. Furthermore, our MT-CSD dataset serves as a valuable resource to catalyze advancements in cross-domain stance detection, where a classifier is adapted from a different yet related target. We believe that MT-CSD will contribute to advancing real-world applications of stance detection research. Our source code, data, and models are available at \url{https://github.com/nfq729/MT-CSD}.
♻ ☆ RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners LREC
Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Traditional approaches to mitigate these errors involve human or tool-based feedback, such as employing task-specific verifiers or aggregating multiple reasoning paths. These methods, however, either depend heavily on human input or struggle with inconsistent responses. To overcome these limitations, we present RankPrompt, an innovative prompting strategy that empowers LLMs to autonomously rank their responses without needing extra resources. RankPrompt simplifies the ranking challenge into comparative evaluations among different responses, leveraging LLMs' innate ability to generate comparative examples within context. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Furthermore, RankPrompt shows exceptional performance in LLM-based automatic evaluations for open-ended tasks, matching human judgments 74% of the time in the AlpacaEval dataset. It also proves to be robust against changes in response order and inconsistency. Overall, our findings endorse RankPrompt as an effective method for extracting high-quality feedback directly from language models.
comment: LREC-Coling 2024 Long Paper
♻ ☆ ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
Traditionally, discriminative models have been the predominant choice for tasks like document classification and information extraction. These models make predictions that fall into a limited number of predefined classes, facilitating a binary true or false evaluation and enabling the direct calculation of metrics such as the F1 score. However, recent advancements in generative large language models (GLLMs) have prompted a shift in the field due to their enhanced zero-shot capabilities, which eliminate the need for a downstream dataset and computationally expensive fine-tuning. However, evaluating GLLMs presents a challenge as the binary true or false evaluation used for discriminative models is not applicable to the predictions made by GLLMs. This paper introduces a new metric for generative models called ANLS* for evaluating a wide variety of tasks, including information extraction and classification tasks. The ANLS* metric extends existing ANLS metrics as a drop-in-replacement and is still compatible with previously reported ANLS scores. An evaluation of 7 different datasets, 6 different GLLMs and 3 different prompting methods using the ANLS* metric is also provided, demonstrating the importance of the proposed metric. We also benchmark a novel approach to generate prompts for documents, called SFT, against other prompting techniques such as LATIN. In 27 out of 35 cases, SFT outperforms other techniques and improves the state-of-the-art, sometimes by as much as $18$ percentage points. Sources are available at https://github.com/deepopinion/anls_star_metric
♻ ☆ TiC-CLIP: Continual Training of CLIP Models ICLR 2024
Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.
comment: ICLR 2024
♻ ☆ Uncertainty-Aware Relational Graph Neural Network for Few-Shot Knowledge Graph Completion
Few-shot knowledge graph completion (FKGC) aims to query the unseen facts of a relation given its few-shot reference entity pairs. The side effect of noises due to the uncertainty of entities and triples may limit the few-shot learning, but existing FKGC works neglect such uncertainty, which leads them more susceptible to limited reference samples with noises. In this paper, we propose a novel uncertainty-aware few-shot KG completion framework (UFKGC) to model uncertainty for a better understanding of the limited data by learning representations under Gaussian distribution. Uncertainty representation is first designed for estimating the uncertainty scope of the entity pairs after transferring feature representations into a Gaussian distribution. Further, to better integrate the neighbors with uncertainty characteristics for entity features, we design an uncertainty-aware relational graph neural network (UR-GNN) to conduct convolution operations between the Gaussian distributions. Then, multiple random samplings are conducted for reference triples within the Gaussian distribution to generate smooth reference representations during the optimization. The final completion score for each query instance is measured by the designed uncertainty optimization to make our approach more robust to the noises in few-shot scenarios. Experimental results show that our approach achieves excellent performance on two benchmark datasets compared to its competitors.
♻ ☆ RoleInteract: Evaluating the Social Interaction of Role-Playing Agents
Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.
♻ ☆ Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese
In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.
♻ ☆ CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning ICDE 2024
Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative. To ensure the high quality of LLM-generated instruction datasets, several approaches have been proposed. Nevertheless, existing methods either compromise dataset integrity by filtering a large proportion of samples, or are unsuitable for industrial applications. In this paper, instead of discarding low-quality samples, we propose CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset. CoachLM is trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset from 17.7% to 78.9%. The effectiveness of CoachLM is further assessed on various real-world instruction test sets. The results show that CoachLM improves the instruction-following capabilities of the instruction-tuned LLM by an average of 29.9%, which even surpasses larger LLMs with nearly twice the number of parameters. Furthermore, CoachLM is successfully deployed in a data management system for LLMs at Huawei, resulting in an efficiency improvement of up to 20% in the cleaning of 40k real-world instruction pairs. We release various assets of CoachLM, including the training data, code and test set (https://github.com/lunyiliu/CoachLM).
comment: Accepted by ICDE 2024
♻ ☆ Arcee's MergeKit: A Toolkit for Merging Large Language Models
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.
comment: 11 pages, 4 figures
♻ ☆ Language Models Hallucinate, but May Excel at Fact Verification NAACL 2024
Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently "hallucinate," resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments. Surprisingly, FLAN-T5-11B, the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for developing trustworthy generation models.
comment: Accepted in NAACL 2024
♻ ☆ From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards
Recent progress in large language models (LLMs) has led to their widespread adoption in various domains. However, these advancements have also introduced additional safety risks and raised concerns regarding their detrimental impact on already marginalized populations. Despite growing mitigation efforts to develop safety safeguards, such as supervised safety-oriented fine-tuning and leveraging safe reinforcement learning from human feedback, multiple concerns regarding the safety and ingrained biases in these models remain. Furthermore, previous work has demonstrated that models optimized for safety often display exaggerated safety behaviors, such as a tendency to refrain from responding to certain requests as a precautionary measure. As such, a clear trade-off between the helpfulness and safety of these models has been documented in the literature. In this paper, we further investigate the effectiveness of safety measures by evaluating models on already mitigated biases. Using the case of Llama 2 as an example, we illustrate how LLMs' safety responses can still encode harmful assumptions. To do so, we create a set of non-toxic prompts, which we then use to evaluate Llama models. Through our new taxonomy of LLMs responses to users, we observe that the safety/helpfulness trade-offs are more pronounced for certain demographic groups which can lead to quality-of-service harms for marginalized populations.
comment: 9 pages, 4 figures
♻ ☆ Facilitating NSFW Text Detection in Open-Domain Dialogue Systems via Knowledge Distillation
NSFW (Not Safe for Work) content, in the context of a dialogue, can have severe side effects on users in open-domain dialogue systems. However, research on detecting NSFW language, especially sexually explicit content, within a dialogue context has significantly lagged behind. To address this issue, we introduce CensorChat, a dialogue monitoring dataset aimed at NSFW dialogue detection. Leveraging knowledge distillation techniques involving GPT-4 and ChatGPT, this dataset offers a cost-effective means of constructing NSFW content detectors. The process entails collecting real-life human-machine interaction data and breaking it down into single utterances and single-turn dialogues, with the chatbot delivering the final utterance. ChatGPT is employed to annotate unlabeled data, serving as a training set. Rationale validation and test sets are constructed using ChatGPT and GPT-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. A BERT model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is assessed. The study emphasizes the importance of AI systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.
comment: As we have submitted a final version arXiv:2403.13250, we decide to withdraw it
♻ ☆ ChatGPT4PCG Competition: Character-like Level Generation for Science Birds
This paper presents the first ChatGPT4PCG Competition at the 2023 IEEE Conference on Games. The objective of this competition is for participants to create effective prompts for ChatGPT--enabling it to generate Science Birds levels with high stability and character-like qualities--fully using their creativity as well as prompt engineering skills. ChatGPT is a conversational agent developed by OpenAI. Science Birds is selected as the competition platform because designing an Angry Birds-like level is not a trivial task due to the in-game gravity; the quality of the levels is determined by their stability. To lower the entry barrier to the competition, we limit the task to the generation of capitalized English alphabetical characters. We also allow only a single prompt to be used for generating all the characters. Here, the quality of the generated levels is determined by their stability and similarity to the given characters. A sample prompt is provided to participants for their reference. An experiment is conducted to determine the effectiveness of several modified versions of this sample prompt on level stability and similarity by testing them on several characters. To the best of our knowledge, we believe that ChatGPT4PCG is the first competition of its kind and hope to inspire enthusiasm for prompt engineering in procedural content generation.
comment: This paper accepted for presentation at IEEE CoG 2023 is made available for participants of ChatGPT4PCG Competition (https://chatgpt4pcg.github.io/) and readers interested in relevant areas. In this PDF version, the affiliation symbol of Julian Togelius has been revised
♻ ☆ ComCLIP: Training-Free Compositional Image and Text Matching
Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel \textbf{\textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP.
♻ ☆ Self-Improving for Zero-Shot Named Entity Recognition with Large Language Models NAACL 2024
Exploring the application of powerful large language models (LLMs) on the named entity recognition (NER) task has drawn much attention recently. This work pushes the performance boundary of zero-shot NER with LLMs by proposing a training-free self-improving framework, which utilizes an unlabeled corpus to stimulate the self-learning ability of LLMs. First, we use the LLM to make predictions on the unlabeled corpus using self-consistency and obtain a self-annotated dataset. Second, we explore various strategies to select reliable annotations to form a reliable self-annotated dataset. Finally, for each test input, we retrieve demonstrations from the reliable self-annotated dataset and perform inference via in-context learning. Experiments on four benchmarks show substantial performance improvements achieved by our framework. Through comprehensive experimental analysis, we find that increasing the size of unlabeled corpus or iterations of self-improving does not guarantee further improvement, but the performance might be boosted via more advanced strategies for reliable annotation selection. Code and data are publicly available at https://github.com/Emma1066/Self-Improve-Zero-Shot-NER
comment: Accepted to NAACL 2024 (Main Conference)
♻ ☆ VQPy: An Object-Oriented Approach to Modern Video Analytics
Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
♻ ☆ CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds ICASSP 2024
This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries - and the accompanying CryCeleb 2023 task, which is a public speaker verification challenge based on cry sounds. We released more than 6 hours of manually segmented cry sounds from 786 newborns for academic use, aiming to encourage research in infant cry analysis. The inaugural public competition attracted 59 participants, 11 of whom improved the baseline performance. The top-performing system achieved a significant improvement scoring 25.8% equal error rate, which is still far from the performance of state-of-the-art adult speaker verification systems. Therefore, we believe there is room for further research on this dataset, potentially extending beyond the verification task.
comment: ICASSP 2024
♻ ☆ Do Large Language Models understand Medical Codes?
The overarching goal of recent AI research has been to make steady progress towards achieving Artificial General Intelligence (AGI), prompting the evaluation of Large Language Models (LLMs) across a variety of tasks and domains. One such domain is healthcare, where LLMs can greatly benefit clinical practice by assisting with a wide range of tasks. However, these models are also prone to producing ``hallucinations" or incorrect responses when faced with queries they cannot adequately address, raising concerns and skepticism, especially within the healthcare community. In this work, we investigate whether LLMs understand and can predict medical codes, which are extensively utilized in healthcare practice. This study aims to delineate the capabilities and limitations of these LLMs. We evaluate various off-the-shelf LLMs (e.g., GPT, LLaMA, etc.) and LLMs specifically designed for biomedical applications to assess their awareness and understanding of these domain-specific terminologies. Our results indicate that these models as they currently stand do not comprehend the meaning of the medical codes, highlighting the need for better representation of these alphanumeric codes extensively used in healthcare. We call for improved strategies to effectively capture and represent the nuances of medical codes and terminologies within LLMs, enabling them to become more reliable and trustworthy tools for healthcare professionals.
♻ ☆ MacGyver: Are Large Language Models Creative Problem Solvers? NAACL 2024
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking. This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.
comment: NAACL 2024
♻ ☆ Tur[k]ingBench: A Challenge Benchmark for Web Agents
Recent chatbots have demonstrated impressive ability to understand and communicate in raw-text form. However, there is more to the world than raw text. For example, humans spend long hours of their time on web pages, where text is intertwined with other modalities and tasks are accomplished in the form of various complex interactions. Can state-of-the-art multi-modal models generalize to such complex domains? To address this question, we introduce TurkingBench, a benchmark of tasks formulated as web pages containing textual instructions with multi-modal context. Unlike existing work which employs artificially synthesized web pages, here we use natural HTML pages that were originally designed for crowdsourcing workers for various annotation purposes. The HTML instructions of each task are also instantiated with various values (obtained from the crowdsourcing tasks) to form new instances of the task. This benchmark contains 32.2K instances distributed across 158 tasks. Additionally, to facilitate the evaluation on TurkingBench, we develop an evaluation framework that connects the responses of chatbots to modifications on web pages (modifying a text box, checking a radio, etc.). We evaluate the performance of state-of-the-art models, including language-only, vision-only, and layout-only models, and their combinations, on this benchmark. Our findings reveal that these models perform significantly better than random chance, yet considerable room exists for improvement. We hope this benchmark will help facilitate the evaluation and development of web-based agents.
♻ ☆ Incentivizing News Consumption on Social Media Platforms Using Large Language Models and Realistic Bot Accounts
Polarization, declining trust, and wavering support for democratic norms are pressing threats to U.S. democracy. Exposure to verified and quality news may lower individual susceptibility to these threats and make citizens more resilient to misinformation, populism, and hyperpartisan rhetoric. This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment (from 1/19/2023 to 2/3/2023) on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. To further test differential effects by gender of the bots, treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our over-time intervention enhances the following of news media organization, the sharing and the liking of news content and the tweeting about politics and the liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control. Most of these results, however, were small in magnitude and confined to the already politically interested Twitter users, as indicated by their pre-treatment tweeting about politics. These findings have implications for social media and news organizations, and also offer direction for future work on how Large Language Models and other computational interventions can effectively enhance individual on-platform engagement with quality news and public affairs.
♻ ☆ BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities. Our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. Given that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.
Software Engineering
☆ Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals
As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open questions and challenges that academia and industry should address to realize the vision of next-generation AI coding assistants.
☆ Towards Single-System Illusion in Software-Defined Vehicles -- Automated, AI-Powered Workflow
We propose a novel model- and feature-based approach to development of vehicle software systems, where the end architecture is not explicitly defined. Instead, it emerges from an iterative process of search and optimization given certain constraints, requirements and hardware architecture, while retaining the property of single-system illusion, where applications run in a logically uniform environment. One of the key points of the presented approach is the inclusion of modern generative AI, specifically Large Language Models (LLMs), in the loop. With the recent advances in the field, we expect that the LLMs will be able to assist in processing of requirements, generation of formal system models, as well as generation of software deployment specification and test code. The resulting pipeline is automated to a large extent, with feedback being generated at each step.
☆ DomainLab: A modular Python package for domain generalization in deep learning
Poor generalization performance caused by distribution shifts in unseen domains often hinders the trustworthy deployment of deep neural networks. Many domain generalization techniques address this problem by adding a domain invariant regularization loss terms during training. However, there is a lack of modular software that allows users to combine the advantages of different methods with minimal effort for reproducibility. DomainLab is a modular Python package for training user specified neural networks with composable regularization loss terms. Its decoupled design allows the separation of neural networks from regularization loss construction. Hierarchical combinations of neural networks, different domain generalization methods, and associated hyperparameters, can all be specified together with other experimental setup in a single configuration file. Hierarchical combinations of neural networks, different domain generalization methods, and associated hyperparameters, can all be specified together with other experimental setup in a single configuration file. In addition, DomainLab offers powerful benchmarking functionality to evaluate the generalization performance of neural networks in out-of-distribution data. The package supports running the specified benchmark on an HPC cluster or on a standalone machine. The package is well tested with over 95 percent coverage and well documented. From the user perspective, it is closed to modification but open to extension. The package is under the MIT license, and its source code, tutorial and documentation can be found at https://github.com/marrlab/DomainLab.
☆ Multi-role Consensus through LLMs Discussions for Vulnerability Detection
Recent advancements in large language models (LLMs) have highlighted the potential for vulnerability detection, a crucial component of software quality assurance. Despite this progress, most studies have been limited to the perspective of a single role, usually testers, lacking diverse viewpoints from different roles in a typical software development life-cycle, including both developers and testers. To this end, this paper introduces an approach to employ LLMs to act as different roles to simulate real-life code review process, engaging in discussions towards a consensus on the existence and classification of vulnerabilities in the code. Preliminary evaluation of the proposed approach indicates a 4.73% increase in the precision rate, 58.9% increase in the recall rate, and a 28.1% increase in the F1 score.
☆ A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at https://github.com/QiushiSun/NCISurvey.
comment: 64 pages, 6 figures, 10 tables, 688 references
☆ Navigating Fairness: Practitioners' Understanding, Challenges, and Strategies in AI/ML Development
The rise in the use of AI/ML applications across industries has sparked more discussions about the fairness of AI/ML in recent times. While prior research on the fairness of AI/ML exists, there is a lack of empirical studies focused on understanding the views and experiences of AI practitioners in developing a fair AI/ML. Understanding AI practitioners' views and experiences on the fairness of AI/ML is important because they are directly involved in its development and deployment and their insights can offer valuable real-world perspectives on the challenges associated with ensuring fairness in AI/ML. We conducted semi-structured interviews with 22 AI practitioners to investigate their understanding of what a 'fair AI/ML' is, the challenges they face in developing a fair AI/ML, the consequences of developing an unfair AI/ML, and the strategies they employ to ensure AI/ML fairness. We developed a framework showcasing the relationship between AI practitioners' understanding of 'fair AI/ML' and (i) their challenges in its development, (ii) the consequences of developing an unfair AI/ML, and (iii) strategies used to ensure AI/ML fairness. Additionally, we also identify areas for further investigation and offer recommendations to aid AI practitioners and AI companies in navigating fairness.
comment: 31 pages, 8 figures, 2 tables
♻ ☆ Collaborative Distributed Machine Learning
Various collaborative distributed machine learning (CDML) systems, including federated learning systems and swarm learning systems, with different key traits were developed to leverage resources for development and use of machine learning (ML) models in a confidentiality-preserving way. To meet use case requirements, suitable CDML systems need to be selected. However, comparison between CDML systems regarding their suitability for use cases is often difficult. This work presents a CDML system conceptualization and CDML archetypes to support comparison of CDML systems and introduce scientific and practical audiences to the principal functioning and key traits of CDML systems.
♻ ☆ Common Challenges of Deep Reinforcement Learning Applications Development: An Empirical Study
Machine Learning (ML) is increasingly being adopted in different industries. Deep Reinforcement Learning (DRL) is a subdomain of ML used to produce intelligent agents. Despite recent developments in DRL technology, the main challenges that developers face in the development of DRL applications are still unknown. To fill this gap, in this paper, we conduct a large-scale empirical study of 927 DRL-related posts extracted from Stack Overflow, the most popular Q&A platform in the software community. Through the process of labeling and categorizing extracted posts, we created a taxonomy of common challenges encountered in the development of DRL applications, along with their corresponding popularity levels. This taxonomy has been validated through a survey involving 59 DRL developers. Results show that at least 45% of developers experienced 18 of the 21 challenges identified in the taxonomy. The most frequent source of difficulty during the development of DRL applications are Comprehension, API usage, and Design problems, while Parallel processing, and DRL libraries/frameworks are classified as the most difficult challenges to address, with respect to the time required to receive an accepted answer. We hope that the research community will leverage this taxonomy to develop efficient strategies to address the identified challenges and improve the quality of DRL applications.
comment: Submitted to Empirical Software Engineering journal
♻ ☆ Toward a Theory of Causation for Interpreting Neural Code Models
Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs.
♻ ☆ Pushing the Limits: Concurrency Detection in Acyclic Sound Free-Choice Workflow Nets in $O(P^2 + T^2)$
Concurrency is an important aspect of Petri nets to describe and simulate the behavior of complex systems. Knowing which places and transitions could be executed in parallel helps to understand nets and enables analysis techniques and the computation of other properties, such as causality, exclusivity, etc.. All techniques based on concurrency detection depend on the efficiency of this detection methodology. Kovalyov and Esparza have developed algorithms that compute all concurrent places in $O\big((P+T)TP^2\big)$ for live and bounded nets (where $P$ and $T$ are the numbers of places and transitions) and in $O\big(P(P+T)^2\big)$ for live and bounded free-choice nets. Although these algorithms have a reasonably good computational complexity, large numbers of concurrent pairs of nodes may still lead to long computation times. This paper complements the palette of concurrency detection algorithms with the Concurrent Paths (CP) algorithm for sound free-choice workflow nets. The algorithm allows parallelization and has a worst-case computational complexity of $O(P^2 + T^2)$ for acyclic nets and of $O(P^3 + PT^2)$ for cyclic nets. Although the computational complexity of cyclic nets has not improved, the evaluation shows the benefits of CP, especially, if the net contains many nodes in concurrency relation.
comment: 19 pages, 14 figures, 5 algorithms
♻ ☆ AlloyASG: Alloy Predicate Code Representation as a Compact Structurally Balanced Graph
In the program analysis and automated bug-fixing fields, it is common to create an abstract interpretation of a program's source code as an Abstract Syntax Tree (AST), which enables programs written in a high-level language to have various static and dynamic analyses applied. However, ASTs suffer from exponential growth in their data size due to the limitation that ASTs will often have identical nodes separately listed in the tree. To address this issue, we introduce a novel code representation schema, Complex Structurally Balanced Abstract Semantic Graph (CSBASG), which represents code as a complex-weighted directed graph that lists a semantic element as a node in the graph and ensures its structural balance for almost finitely enumerable code segments, such as the modeling language Alloy. Our experiment ensures that CSBASG provides a one-on-one correspondence of Alloy predicates to complex-weighted graphs. We evaluate the effectiveness and efficiency of our CSBASG representation for Alloy models and identify future applications of CSBASG for Alloy code generation and automated repair.
comment: 11 pages
♻ ☆ MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing ISSTA24
Microservice resilience, the ability of microservices to recover from failures and continue providing reliable and responsive services, is crucial for cloud vendors. However, the current practice relies on manually configured rules specific to a certain microservice system, resulting in labor-intensity and flexibility issues, given the large scale and high dynamics of microservices. A more labor-efficient and versatile solution is desired. Our insight is that resilient deployment can effectively prevent the dissemination of degradation from system performance metrics to user-aware metrics, and the latter directly affects service quality. In other words, failures in a non-resilient deployment can impact both types of metrics, leading to user dissatisfaction. With this in mind, we propose MicroRes, the first versatile resilience profiling framework for microservices via degradation dissemination indexing. MicroRes first injects failures into microservices and collects available monitoring metrics. Then, it ranks the metrics according to their contributions to the overall service degradation. It produces a resilience index by how much the degradation is disseminated from system performance metrics to user-aware metrics. Higher degradation dissemination indicates lower resilience. We evaluate MicroRes on two open-source and one industrial microservice system. The experiments show MicroRes' efficient and effective resilience profiling of microservices. We also showcase MicroRes' practical usage in production.
comment: In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA24)
♻ ☆ The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI
Generative AI (GAI) offers unprecedented possibilities but its commercialization has raised concerns about transparency, reproducibility, bias, and safety. Many "open-source" GAI models lack the necessary components for full understanding and reproduction, and some use restrictive licenses, a practice known as "openwashing." We propose the Model Openness Framework (MOF), a ranked classification system that rates machine learning models based on their completeness and openness, following principles of open science, open source, open data, and open access. The MOF requires specific components of the model development lifecycle to be included and released under appropriate open licenses. This framework aims to prevent misrepresentation of models claiming to be open, guide researchers and developers in providing all model components under permissive licenses, and help companies, academia, and hobbyists identify models that can be safely adopted without restrictions. Wide adoption of the MOF will foster a more open AI ecosystem, accelerating research, innovation, and adoption.
comment: 45 pages
Programming Languages
☆ Lean4Lean: Towards a formalized metatheory for the Lean theorem prover
In this paper we present a new "external verifier" for the Lean theorem prover, written in Lean itself. This is the first complete verifier for Lean 4 other than the reference implementation in C++ used by Lean itself, and our new verifier is competitive with the original, running between 20% and 50% slower and usable to verify all of Lean's mathlib library, forming an additional step in Lean's aim to self-host the full elaborator and compiler. Moreover, because the verifier is written in a language which admits formal verification, it is possible to state and prove properties about the kernel itself, and we report on some initial steps taken in this direction to formalize the Lean type theory abstractly and show that the kernel correctly implements this theory, to eliminate the possibility of implementation bugs in the kernel and increase the trustworthiness of proofs conducted in it. This work is still ongoing but we plan to use this project to help justify any future changes to the kernel and type theory and ensure unsoundness does not sneak in through either the abstract theory or implementation bugs.
comment: 17 pages, submitted to ITP 2024
☆ The Elements of Differentiable Programming
Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.
comment: Draft version 1
☆ A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at https://github.com/QiushiSun/NCISurvey.
comment: 64 pages, 6 figures, 10 tables, 688 references
♻ ☆ Combining Type Checking and Set Constraint Solving to Improve Automated Software Verification
In this paper we show how prescritive type checking and constraint solving can be combined to increase automation during software verification. We do so by defining a type system and implementing a typechecker for {log} (read `setlog'), a Constraint Logic Programming (CLP) language and satisfiability solver based on set theory. Hence, we proceed as follows: a) a type system for {log} is defined; b) the constraint solver is proved to be safe w.r.t. the type system; c) the implementation of a concrete typechecker is presented; d) the integration of type checking and set constraint solving to increase automation during software verification is discussed; and f) two industrial-strength case studies are presented where this combination is used with very good results. Under consideration in Theory and Practice of Logic Programming (TPLP)
comment: Under consideration in Theory and Practice of Logic Programming (TPLP)
♻ ☆ Formalizing Stack Safety as a Security Property
The term stack safety is used to describe a variety of compiler, run-time, and hardware mechanisms for protecting stack memory. Unlike "the heap," the ISA-level stack does not correspond to a single high-level language concept: different compilers use it in different ways to support procedural and functional abstraction mechanisms from a wide range of languages. This protean nature makes it difficult to nail down what it means to correctly enforce stack safety. We propose a new formal characterization of stack safety using concepts from language-based security. Rather than treating stack safety as a monolithic property, we decompose it into an integrity property and a confidentiality property for each of the caller and the callee, plus a control-flow property: five properties in all. This formulation is motivated by a particular class of enforcement mechanisms, the "lazy" stack safety micro-policies studied by Roessler and DeHon, which permit functions to write into one another's frames but taint the changed locations so that the frame's owner cannot access them. No existing characterization of stack safety captures this style of safety; we capture it here by stating our properties in terms of the observable behavior of the system. Our properties go further than previous formal definitions of stack safety, supporting caller- and callee-saved registers, arguments passed on the stack, and tail-call elimination. We validate the properties by using them to distinguish between correct and incorrect implementations of Roessler and DeHon's micro-policies using property-based random testing. Our test harness successfully identifies several broken variants, including Roessler and DeHon's lazy policy; a repaired version of their policy passes our tests.
♻ ☆ AlloyASG: Alloy Predicate Code Representation as a Compact Structurally Balanced Graph
In the program analysis and automated bug-fixing fields, it is common to create an abstract interpretation of a program's source code as an Abstract Syntax Tree (AST), which enables programs written in a high-level language to have various static and dynamic analyses applied. However, ASTs suffer from exponential growth in their data size due to the limitation that ASTs will often have identical nodes separately listed in the tree. To address this issue, we introduce a novel code representation schema, Complex Structurally Balanced Abstract Semantic Graph (CSBASG), which represents code as a complex-weighted directed graph that lists a semantic element as a node in the graph and ensures its structural balance for almost finitely enumerable code segments, such as the modeling language Alloy. Our experiment ensures that CSBASG provides a one-on-one correspondence of Alloy predicates to complex-weighted graphs. We evaluate the effectiveness and efficiency of our CSBASG representation for Alloy models and identify future applications of CSBASG for Alloy code generation and automated repair.
comment: 11 pages
Computer Vision and Pattern Recognition
☆ Zero-Shot Multi-Object Shape Completion
We present a 3D shape completion method that recovers the complete geometry of multiple objects in complex scenes from a single RGB-D image. Despite notable advancements in single object 3D shape completion, high-quality reconstructions in highly cluttered real-world multi-object scenes remains a challenge. To address this issue, we propose OctMAE, an architecture that leverages an Octree U-Net and a latent 3D MAE to achieve high-quality and near real-time multi-object shape completion through both local and global geometric reasoning. Because a na\"ive 3D MAE can be computationally intractable and memory intensive even in the latent space, we introduce a novel occlusion masking strategy and adopt 3D rotary embeddings, which significantly improves the runtime and shape completion quality. To generalize to a wide range of objects in diverse scenes, we create a large-scale photorealistic dataset, featuring a diverse set of 12K 3D object models from the Objaverse dataset which are rendered in multi-object scenes with physics-based positioning. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets and demonstrates a strong zero-shot capability.
comment: 21 pages, 8 figues
☆ MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
We propose MVSplat, an efficient feed-forward 3D Gaussian Splatting model learned from sparse multi-view images. To accurately localize the Gaussian centers, we propose to build a cost volume representation via plane sweeping in the 3D space, where the cross-view feature similarities stored in the cost volume can provide valuable geometry cues to the estimation of depth. We learn the Gaussian primitives' opacities, covariances, and spherical harmonics coefficients jointly with the Gaussian centers while only relying on photometric supervision. We demonstrate the importance of the cost volume representation in learning feed-forward Gaussian Splatting models via extensive experimental evaluations. On the large-scale RealEstate10K and ACID benchmarks, our model achieves state-of-the-art performance with the fastest feed-forward inference speed (22 fps). Compared to the latest state-of-the-art method pixelSplat, our model uses $10\times $ fewer parameters and infers more than $2\times$ faster while providing higher appearance and geometry quality as well as better cross-dataset generalization.
comment: Project page: https://donydchen.github.io/mvsplat Code: https://github.com/donydchen/mvsplat
☆ LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors
We present a simple self-supervised method to enhance the performance of ViT features for dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and compact postprocessing network that can be applied to enhance the features of any pre-trained ViT backbone. LiFT is fast and easy to train with a self-supervised objective, and it boosts the density of ViT features for minimal extra inference cost. Furthermore, we demonstrate that LiFT can be applied with approaches that use additional task-specific downstream modules, as we integrate LiFT with ViTDet for COCO detection and segmentation. Despite the simplicity of LiFT, we find that it is not simply learning a more complex version of bilinear interpolation. Instead, our LiFT training protocol leads to several desirable emergent properties that benefit ViT features in dense downstream tasks. This includes greater scale invariance for features, and better object boundary maps. By simply training LiFT for a few epochs, we show improved performance on keypoint correspondence, detection, segmentation, and object discovery tasks. Overall, LiFT provides an easy way to unlock the benefits of denser feature arrays for a fraction of the computational cost. For more details, refer to our project page at https://www.cs.umd.edu/~sakshams/LiFT/.
☆ ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer
Obstacle detection and tracking represent a critical component in robot autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based model to address both obstacle detection and tracking problems. For the detection task, our approach leverages deformable attention to construct a 3D cost volume, which is decoded progressively in the form of voxel occupancy grids. We further track the obstacles by matching the voxels between consecutive frames. The entire model can be optimized in an end-to-end manner. Through extensive experiments on DrivingStereo and KITTI benchmarks, our model achieves state-of-the-art performance in the obstacle detection task. We also report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their computation cost, typically ten-fold to twenty-fold less. The code and model weights will be publicly released.
comment: 8 pages
☆ MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io
comment: 46 Pages, Work in Progress, Benchmark Project Page: https://mathverse-cuhk.github.io
☆ Simplified Diffusion Schrödinger Bridge
This paper introduces a novel theoretical simplification of the Diffusion Schr\"odinger Bridge (DSB) that facilitates its unification with Score-based Generative Models (SGMs), addressing the limitations of DSB in complex data generation and enabling faster convergence and enhanced performance. By employing SGMs as an initial solution for DSB, our approach capitalizes on the strengths of both frameworks, ensuring a more efficient training process and improving the performance of SGM. We also propose a reparameterization technique that, despite theoretical approximations, practically improves the network's fitting capabilities. Our extensive experimental evaluations confirm the effectiveness of the simplified DSB, demonstrating its significant improvements. We believe the contributions of this work pave the way for advanced generative modeling. The code is available at https://github.com/tzco/Simplified-Diffusion-Schrodinger-Bridge.
☆ Language Repository for Long Video Understanding
Language has become a prominent modality in computer vision with the rise of multi-modal LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation. Our repository is updated iteratively based on multi-scale video chunks. We introduce write and read operations that focus on pruning redundancies in text, and extracting information at various temporal scales. The proposed framework is evaluated on zero-shot visual question-answering benchmarks including EgoSchema, NExT-QA, IntentQA and NExT-GQA, showing state-of-the-art performance at its scale. Our code is available at https://github.com/kkahatapitiya/LangRepo.
☆ GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation
We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. Together, our transformer architecture and the use of 3D Gaussians unlock a scalable and efficient reconstruction framework. Extensive experimental results demonstrate the superiority of our method over alternatives regarding both reconstruction quality and efficiency. We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models. Our project website is at: https://justimyhxu.github.io/projects/grm/.
comment: Project page: https://justimyhxu.github.io/projects/grm/ Code: https://github.com/justimyhxu/GRM
☆ ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition
3D decomposition/segmentation still remains a challenge as large-scale 3D annotated data is not readily available. Contemporary approaches typically leverage 2D machine-generated segments, integrating them for 3D consistency. While the majority of these methods are based on NeRFs, they face a potential weakness that the instance/semantic embedding features derive from independent MLPs, thus preventing the segmentation network from learning the geometric details of the objects directly through radiance and density. In this paper, we propose ClusteringSDF, a novel approach to achieve both segmentation and reconstruction in 3D via the neural implicit surface representation, specifically Signal Distance Function (SDF), where the segmentation rendering is directly integrated with the volume rendering of neural implicit surfaces. Although based on ObjectSDF++, ClusteringSDF no longer requires the ground-truth segments for supervision while maintaining the capability of reconstructing individual object surfaces, but purely with the noisy and inconsistent labels from pre-trained models.As the core of ClusteringSDF, we introduce a high-efficient clustering mechanism for lifting the 2D labels to 3D and the experimental results on the challenging scenes from ScanNet and Replica datasets show that ClusteringSDF can achieve competitive performance compared against the state-of-the-art with significantly reduced training time.
comment: Project Page: https://sm0kywu.github.io/ClusteringSDF/
☆ Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
☆ Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning
Self-supervised representation learning has been highly promising for histopathology image analysis with numerous approaches leveraging their patient-slide-patch hierarchy to learn better representations. In this paper, we explore how the combination of domain specific natural language information with such hierarchical visual representations can benefit rich representation learning for medical image tasks. Building on automated language description generation for features visible in histopathology images, we present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. We explore contrastive objectives and granular language description based text alignment at multiple hierarchies to inject language modality information into the visual representations. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets. Our framework also provides better interpretability with our language aligned representation space. Code is available at https://github.com/Hasindri/HLSS.
comment: 13 pages and 5 figures
☆ AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Our code is available at https://github.com/c-yn/AdaIR.
comment: 28 pages,15 figures
☆ DreamReward: Text-to-3D Generation with Human Preference
3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D -- the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models.
comment: Project page: https://jamesyjl.github.io/DreamReward
☆ Explorative Inbetweening of Time and Space
We introduce bounded generation as a generalized task to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strategy, which we call Time Reversal Fusion, that fuses the temporally forward and backward denoising paths conditioned on the start and end frame, respectively. The fused path results in a video that smoothly connects the two frames, generating inbetweening of faithful subject motion, novel views of static scenes, and seamless video looping when the two bounding frames are identical. We curate a diverse evaluation dataset of image pairs and compare against the closest existing methods. We find that Time Reversal Fusion outperforms related work on all subtasks, exhibiting the ability to generate complex motions and 3D-consistent views guided by bounded frames. See project page at https://time-reversal.github.io.
comment: project page at https://time-reversal.github.io
☆ T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection. Model API is now available at \url{https://github.com/IDEA-Research/T-Rex}.
comment: Technical Report
☆ ReNoise: Real Image Inversion Through Iterative Noising
Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities. However, applying these methods to real images necessitates the inversion of the images into the domain of the pretrained diffusion model. Achieving faithful inversion remains a challenge, particularly for more recent models trained to generate images with a small number of denoising steps. In this work, we introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations. Building on reversing the diffusion sampling process, our method employs an iterative renoising mechanism at each inversion sampling step. This mechanism refines the approximation of a predicted point along the forward diffusion trajectory, by iteratively applying the pretrained diffusion model, and averaging these predictions. We evaluate the performance of our ReNoise technique using various sampling algorithms and models, including recent accelerated diffusion models. Through comprehensive evaluations and comparisons, we show its effectiveness in terms of both accuracy and speed. Furthermore, we confirm that our method preserves editability by demonstrating text-driven image editing on real images.
comment: project page at: https://garibida.github.io/ReNoise-Inversion/
☆ MyVLM: Personalizing VLMs for User-Specific Queries
Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.
comment: Project page: https://snap-research.github.io/MyVLM/
☆ PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. This schema includes images, task instructions, conditional prompts, and mask tokens, which enable the model to generate and classify segmentation masks effectively. The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization. PSALM achieves superior results on several benchmarks, such as RefCOCO/RefCOCO+/RefCOCOg, COCO Panoptic Segmentation, and COCO-Interactive, and further exhibits zero-shot capabilities on unseen tasks, such as open-vocabulary segmentation, generalized referring expression segmentation and video object segmentation, making a significant step towards a GPT moment in computer vision. Through extensive experiments, PSALM demonstrates its potential to transform the domain of image segmentation, leveraging the robust visual understanding capabilities of LMMs as seen in natural language processing. Code and models are available at https://github.com/zamling/PSALM.
☆ VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition
Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities. However, it is non-trivial to perform accurate image-LiDAR global place recognition since extracting consistent and robust global descriptors from different domains (2D images and 3D point clouds) is challenging. To address this issue, we propose a novel Voxel-Cross-Pixel (VXP) approach, which establishes voxel and pixel correspondences in a self-supervised manner and brings them into a shared feature space. Specifically, VXP is trained in a two-stage manner that first explicitly exploits local feature correspondences and enforces similarity of global descriptors. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate our method surpasses the state-of-the-art cross-modal retrieval by a large margin.
comment: Project page https://yunjinli.github.io/projects-vxp/
☆ Implicit Style-Content Separation using B-LoRA
Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.
☆ Visibility-Aware Keypoint Localization for 6DoF Object Pose Estimation
Localizing predefined 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for 6DoF object pose estimation. However, unreliable localization results of invisible keypoints degrade the quality of correspondences. In this paper, we address this issue by localizing the important keypoints in terms of visibility. Since keypoint visibility information is currently missing in dataset collection process, we propose an efficient way to generate binary visibility labels from available object-level annotations, for keypoints of both asymmetric objects and symmetric objects. We further derive real-valued visibility-aware importance from binary labels based on PageRank algorithm. Taking advantage of the flexibility of our visibility-aware importance, we construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm, along with additional positional encoding. Extensive experiments are conducted on popular pose estimation benchmarks including Linemod, Linemod-Occlusion, and YCB-V. The results show that, VAPO improves both the keypoint correspondences and final estimated poses, and clearly achieves state-of-the-art performances.
☆ Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering
We propose Gaussian Frosting, a novel mesh-based representation for high-quality rendering and editing of complex 3D effects in real-time. Our approach builds on the recent 3D Gaussian Splatting framework, which optimizes a set of 3D Gaussians to approximate a radiance field from images. We propose first extracting a base mesh from Gaussians during optimization, then building and refining an adaptive layer of Gaussians with a variable thickness around the mesh to better capture the fine details and volumetric effects near the surface, such as hair or grass. We call this layer Gaussian Frosting, as it resembles a coating of frosting on a cake. The fuzzier the material, the thicker the frosting. We also introduce a parameterization of the Gaussians to enforce them to stay inside the frosting layer and automatically adjust their parameters when deforming, rescaling, editing or animating the mesh. Our representation allows for efficient rendering using Gaussian splatting, as well as editing and animation by modifying the base mesh. We demonstrate the effectiveness of our method on various synthetic and real scenes, and show that it outperforms existing surface-based approaches. We will release our code and a web-based viewer as additional contributions. Our project page is the following: https://anttwo.github.io/frosting/
comment: Project Webpage: https://anttwo.github.io/frosting/
☆ Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer CVPR 2024
While Transformers have rapidly gained popularity in various computer vision applications, post-hoc explanations of their internal mechanisms remain largely unexplored. Vision Transformers extract visual information by representing image regions as transformed tokens and integrating them via attention weights. However, existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens, which fails to accurately illustrate the rationales behind the models' predictions. To incorporate the influence of token transformation into interpretation, we propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects. Specifically, we quantify token transformation effects by measuring changes in token lengths and correlations in their directions pre- and post-transformation. Moreover, we develop initialization and aggregation rules to integrate both attention weights and token transformation effects across all layers, capturing holistic token contributions throughout the model. Experimental results on segmentation and perturbation tests demonstrate the superiority of our proposed TokenTM compared to state-of-the-art Vision Transformer explanation methods.
comment: CVPR 2024
☆ DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video
We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.
☆ Estimating Physical Information Consistency of Channel Data Augmentation for Remote Sensing Images
The application of data augmentation for deep learning (DL) methods plays an important role in achieving state-of-the-art results in supervised, semi-supervised, and self-supervised image classification. In particular, channel transformations (e.g., solarize, grayscale, brightness adjustments) are integrated into data augmentation pipelines for remote sensing (RS) image classification tasks. However, contradicting beliefs exist about their proper applications to RS images. A common point of critique is that the application of channel augmentation techniques may lead to physically inconsistent spectral data (i.e., pixel signatures). To shed light on the open debate, we propose an approach to estimate whether a channel augmentation technique affects the physical information of RS images. To this end, the proposed approach estimates a score that measures the alignment of a pixel signature within a time series that can be naturally subject to deviations caused by factors such as acquisition conditions or phenological states of vegetation. We compare the scores associated with original and augmented pixel signatures to evaluate the physical consistency. Experimental results on a multi-label image classification task show that channel augmentations yielding a score that exceeds the expected deviation of original pixel signatures can not improve the performance of a baseline model trained without augmentation.
comment: Accepted at the IEEE International Geoscience and Remote Sensing Symposium
☆ Object-Centric Domain Randomization for 3D Shape Reconstruction in the Wild
One of the biggest challenges in single-view 3D shape reconstruction in the wild is the scarcity of <3D shape, 2D image>-paired data from real-world environments. Inspired by remarkable achievements via domain randomization, we propose ObjectDR which synthesizes such paired data via a random simulation of visual variations in object appearances and backgrounds. Our data synthesis framework exploits a conditional generative model (e.g., ControlNet) to generate images conforming to spatial conditions such as 2.5D sketches, which are obtainable through a rendering process of 3D shapes from object collections (e.g., Objaverse-XL). To simulate diverse variations while preserving object silhouettes embedded in spatial conditions, we also introduce a disentangled framework which leverages an initial object guidance. After synthesizing a wide range of data, we pre-train a model on them so that it learns to capture a domain-invariant geometry prior which is consistent across various domains. We validate its effectiveness by substantially improving 3D shape reconstruction models on a real-world benchmark. In a scale-up evaluation, our pre-training achieves 23.6% superior results compared with the pre-training on high-quality computer graphics renderings.
comment: Project Page: https://ObjectDR.github.io
☆ Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets
Sign language recognition (SLR) has recently achieved a breakthrough in performance thanks to deep neural networks trained on large annotated sign datasets. Of the many different sign languages, these annotated datasets are only available for a select few. Since acquiring gloss-level labels on sign language videos is difficult, learning by transferring knowledge from existing annotated sources is useful for recognition in under-resourced sign languages. This study provides a publicly available cross-dataset transfer learning benchmark from two existing public Turkish SLR datasets. We use a temporal graph convolution-based sign language recognition approach to evaluate five supervised transfer learning approaches and experiment with closed-set and partial-set cross-dataset transfer learning. Experiments demonstrate that improvement over finetuning based transfer learning is possible with specialized supervised transfer learning methods.
comment: Accepted to The 18th IEEE International Conference on Automatic Face and Gesture Recognition 2024, Code available in https://github.com/alpk/tid-supervised-transfer-learning-dataset
☆ HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression
3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To address this, we make use of the relations between the unorganized anchors and the structured hash grid, leveraging their mutual information for context modeling, and propose a Hash-grid Assisted Context (HAC) framework for highly compact 3DGS representation. Our approach introduces a binary hash grid to establish continuous spatial consistencies, allowing us to unveil the inherent spatial relations of anchors through a carefully designed context model. To facilitate entropy coding, we utilize Gaussian distributions to accurately estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Additionally, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Importantly, our work is the pioneer to explore context-based compression for 3DGS representation, resulting in a remarkable size reduction of over $75\times$ compared to vanilla 3DGS, while simultaneously improving fidelity, and achieving over $11\times$ size reduction over SOTA 3DGS compression approach Scaffold-GS. Our code is available here: https://github.com/YihangChen-ee/HAC
comment: Project Page: https://yihangchen-ee.github.io/project_hac/ Code: https://github.com/YihangChen-ee/HAC
☆ Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors
Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation. Web page: https://tsagkas.github.io/click2grasp
comment: 8 pages, 4 figures
☆ Invisible Needle Detection in Ultrasound: Leveraging Mechanism-Induced Vibration
In clinical applications that involve ultrasound-guided intervention, the visibility of the needle can be severely impeded due to steep insertion and strong distractors such as speckle noise and anatomical occlusion. To address this challenge, we propose VibNet, a learning-based framework tailored to enhance the robustness and accuracy of needle detection in ultrasound images, even when the target becomes invisible to the naked eye. Inspired by Eulerian Video Magnification techniques, we utilize an external step motor to induce low-amplitude periodic motion on the needle. These subtle vibrations offer the potential to generate robust frequency features for detecting the motion patterns around the needle. To robustly and precisely detect the needle leveraging these vibrations, VibNet integrates learning-based Short-Time-Fourier-Transform and Hough-Transform modules to achieve successive sub-goals, including motion feature extraction in the spatiotemporal space, frequency feature aggregation, and needle detection in the Hough space. Based on the results obtained on distinct ex vivo porcine and bovine tissue samples, the proposed algorithm exhibits superior detection performance with efficient computation and generalization capability.
☆ Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, \textit{e.g.}, LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.
☆ View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network CVPR 2024
Existing person re-identification methods have achieved remarkable advances in appearance-based identity association across homogeneous cameras, such as ground-ground matching. However, as a more practical scenario, aerial-ground person re-identification (AGPReID) among heterogeneous cameras has received minimal attention. To alleviate the disruption of discriminative identity representation by dramatic view discrepancy as the most significant challenge in AGPReID, the view-decoupled transformer (VDT) is proposed as a simple yet effective framework. Two major components are designed in VDT to decouple view-related and view-unrelated features, namely hierarchical subtractive separation and orthogonal loss, where the former separates these two features inside the VDT, and the latter constrains these two to be independent. In addition, we contribute a large-scale AGPReID dataset called CARGO, consisting of five/eight aerial/ground cameras, 5,000 identities, and 108,563 images. Experiments on two datasets show that VDT is a feasible and effective solution for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on CARGO and 3.7%/5.2% on AG-ReID, keeping the same magnitude of computational complexity. Our project is available at https://github.com/LinlyAC/VDT-AGPReID
comment: CVPR 2024
☆ Denoising Diffusion Models for 3D Healthy Brain Tissue Inpainting
Monitoring diseases that affect the brain's structural integrity requires automated analysis of magnetic resonance (MR) images, e.g., for the evaluation of volumetric changes. However, many of the evaluation tools are optimized for analyzing healthy tissue. To enable the evaluation of scans containing pathological tissue, it is therefore required to restore healthy tissue in the pathological areas. In this work, we explore and extend denoising diffusion models for consistent inpainting of healthy 3D brain tissue. We modify state-of-the-art 2D, pseudo-3D, and 3D methods working in the image space, as well as 3D latent and 3D wavelet diffusion models, and train them to synthesize healthy brain tissue. Our evaluation shows that the pseudo-3D model performs best regarding the structural-similarity index, peak signal-to-noise ratio, and mean squared error. To emphasize the clinical relevance, we fine-tune this model on data containing synthetic MS lesions and evaluate it on a downstream brain tissue segmentation task, whereby it outperforms the established FMRIB Software Library (FSL) lesion-filling method.
☆ MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
We propose a novel approach to video anomaly detection: we treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution and model this distribution with a neural network. This lets us estimate the likelihood of test videos and detect video anomalies by thresholding the likelihood estimates. We train our video anomaly detector using a modification of denoising score matching, a method that injects training data with noise to facilitate modeling its distribution. To eliminate hyperparameter selection, we model the distribution of noisy video features across a range of noise levels and introduce a regularizer that tends to align the models for different levels of noise. At test time, we combine anomaly indications at multiple noise scales with a Gaussian mixture model. Running our video anomaly detector induces minimal delays as inference requires merely extracting the features and forward-propagating them through a shallow neural network and a Gaussian mixture model. Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance, both in the object-centric and in the frame-centric setup.
☆ Learning to Project for Cross-Task Knowledge Distillation
Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.
☆ Adversary-Robust Graph-Based Learning of WSIs
Enhancing the robustness of deep learning models against adversarial attacks is crucial, especially in critical domains like healthcare where significant financial interests heighten the risk of such attacks. Whole slide images (WSIs) are high-resolution, digitized versions of tissue samples mounted on glass slides, scanned using sophisticated imaging equipment. The digital analysis of WSIs presents unique challenges due to their gigapixel size and multi-resolution storage format. In this work, we aim at improving the robustness of cancer Gleason grading classification systems against adversarial attacks, addressing challenges at both the image and graph levels. As regards the proposed algorithm, we develop a novel and innovative graph-based model which utilizes GNN to extract features from the graph representation of WSIs. A denoising module, along with a pooling layer is incorporated to manage the impact of adversarial attacks on the WSIs. The process concludes with a transformer module that classifies various grades of prostate cancer based on the processed data. To assess the effectiveness of the proposed method, we conducted a comparative analysis using two scenarios. Initially, we trained and tested the model without the denoiser using WSIs that had not been exposed to any attack. We then introduced a range of attacks at either the image or graph level and processed them through the proposed network. The performance of the model was evaluated in terms of accuracy and kappa scores. The results from this comparison showed a significant improvement in cancer diagnosis accuracy, highlighting the robustness and efficiency of the proposed method in handling adversarial challenges in the context of medical imaging.
☆ DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing
Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.
comment: technical report, 15 pages, webpage: https://design-edit.github.io/
☆ HyperGALE: ASD Classification via Hypergraph Gated Attention with Learnable Hyperedges IJCNN 2024
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by varied social cognitive challenges and repetitive behavioral patterns. Identifying reliable brain imaging-based biomarkers for ASD has been a persistent challenge due to the spectrum's diverse symptomatology. Existing baselines in the field have made significant strides in this direction, yet there remains room for improvement in both performance and interpretability. We propose \emph{HyperGALE}, which builds upon the hypergraph by incorporating learned hyperedges and gated attention mechanisms. This approach has led to substantial improvements in the model's ability to interpret complex brain graph data, offering deeper insights into ASD biomarker characterization. Evaluated on the extensive ABIDE II dataset, \emph{HyperGALE} not only improves interpretability but also demonstrates statistically significant enhancements in key performance metrics compared to both previous baselines and the foundational hypergraph model. The advancement \emph{HyperGALE} brings to ASD research highlights the potential of sophisticated graph-based techniques in neurodevelopmental studies. The source code and implementation instructions are available at GitHub:https://github.com/mehular0ra/HyperGALE.
comment: Accepted to IJCNN 2024
☆ Detoxifying Large Language Models via Knowledge Editing
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments to compare knowledge editing approaches with previous baselines, indicating that knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxify approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.
comment: Ongoing work. Project website: https://zjunlp.github.io/project/SafeEdit Benchmark: https://huggingface.co/datasets/zjunlp/SafeEdit Code: https://github.com/zjunlp/EasyEdit
☆ AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks
Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.
comment: preprint
☆ CathFlow: Self-Supervised Segmentation of Catheters in Interventional Ultrasound Using Optical Flow and Transformers
In minimally invasive endovascular procedures, contrast-enhanced angiography remains the most robust imaging technique. However, it is at the expense of the patient and clinician's health due to prolonged radiation exposure. As an alternative, interventional ultrasound has notable benefits such as being radiation-free, fast to deploy, and having a small footprint in the operating room. Yet, ultrasound is hard to interpret, and highly prone to artifacts and noise. Additionally, interventional radiologists must undergo extensive training before they become qualified to diagnose and treat patients effectively, leading to a shortage of staff, and a lack of open-source datasets. In this work, we seek to address both problems by introducing a self-supervised deep learning architecture to segment catheters in longitudinal ultrasound images, without demanding any labeled data. The network architecture builds upon AiAReSeg, a segmentation transformer built with the Attention in Attention mechanism, and is capable of learning feature changes across time and space. To facilitate training, we used synthetic ultrasound data based on physics-driven catheter insertion simulations, and translated the data into a unique CT-Ultrasound common domain, CACTUSS, to improve the segmentation performance. We generated ground truth segmentation masks by computing the optical flow between adjacent frames using FlowNet2, and performed thresholding to obtain a binary map estimate. Finally, we validated our model on a test dataset, consisting of unseen synthetic data and images collected from silicon aorta phantoms, thus demonstrating its potential for applications to clinical data in the future.
comment: This work has been submitted to the IEEE for possible publication
☆ Exploring 3D Human Pose Estimation and Forecasting from the Robot's Perspective: The HARPER Dataset
We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and \spot, the quadruped robot manufactured by Boston Dynamics. The key-novelty is the focus on the robot's perspective, i.e., on the data captured by the robot's sensors. These make 3D body pose analysis challenging because being close to the ground captures humans only partially. The scenario underlying HARPER includes 15 actions, of which 10 involve physical contact between the robot and users. The Corpus contains not only the recordings of the built-in stereo cameras of Spot, but also those of a 6-camera OptiTrack system (all recordings are synchronized). This leads to ground-truth skeletal representations with a precision lower than a millimeter. In addition, the Corpus includes reproducible benchmarks on 3D Human Pose Estimation, Human Pose Forecasting, and Collision Prediction, all based on publicly available baseline approaches. This enables future HARPER users to rigorously compare their results with those we provide in this work.
☆ RoDLA: Benchmarking the Robustness of Document Layout Analysis Models CVPR 2024
Before developing a Document Layout Analysis (DLA) model in real-world applications, conducting comprehensive robustness testing is essential. However, the robustness of DLA models remains underexplored in the literature. To address this, we are the first to introduce a robustness benchmark for DLA models, which includes 450K document images of three datasets. To cover realistic corruptions, we propose a perturbation taxonomy with 36 common document perturbations inspired by real-world document processing. Additionally, to better understand document perturbation impacts, we propose two metrics, Mean Perturbation Effect (mPE) for perturbation assessment and Mean Robustness Degradation (mRD) for robustness evaluation. Furthermore, we introduce a self-titled model, i.e., Robust Document Layout Analyzer (RoDLA), which improves attention mechanisms to boost extraction of robust features. Experiments on the proposed benchmarks (PubLayNet-P, DocLayNet-P, and M$^6$Doc-P) demonstrate that RoDLA obtains state-of-the-art mRD scores of 115.7, 135.4, and 150.4, respectively. Compared to previous methods, RoDLA achieves notable improvements in mAP of +3.8%, +7.1% and +12.1%, respectively.
comment: Accepted by CVPR 2024. Project page: https://yufanchen96.github.io/projects/RoDLA
☆ Analysing Diffusion Segmentation for Medical Images
Denoising Diffusion Probabilistic models have become increasingly popular due to their ability to offer probabilistic modeling and generate diverse outputs. This versatility inspired their adaptation for image segmentation, where multiple predictions of the model can produce segmentation results that not only achieve high quality but also capture the uncertainty inherent in the model. Here, powerful architectures were proposed for improving diffusion segmentation performance. However, there is a notable lack of analysis and discussions on the differences between diffusion segmentation and image generation, and thorough evaluations are missing that distinguish the improvements these architectures provide for segmentation in general from their benefit for diffusion segmentation specifically. In this work, we critically analyse and discuss how diffusion segmentation for medical images differs from diffusion image generation, with a particular focus on the training behavior. Furthermore, we conduct an assessment how proposed diffusion segmentation architectures perform when trained directly for segmentation. Lastly, we explore how different medical segmentation tasks influence the diffusion segmentation behavior and the diffusion process could be adapted accordingly. With these analyses, we aim to provide in-depth insights into the behavior of diffusion segmentation that allow for a better design and evaluation of diffusion segmentation methods in the future.
☆ Raw Instinct: Trust Your Classifiers and Skip the Conversion
Using RAW-images in computer vision problems is surprisingly underexplored considering that converting from RAW to RGB does not introduce any new capture information. In this paper, we show that a sufficiently advanced classifier can yield equivalent results on RAW input compared to RGB and present a new public dataset consisting of RAW images and the corresponding converted RGB images. Classifying images directly from RAW is attractive, as it allows for skipping the conversion to RGB, lowering computation time significantly. Two CNN classifiers are used to classify the images in both formats, confirming that classification performance can indeed be preserved. We furthermore show that the total computation time from RAW image data to classification results for RAW images can be up to 8.46 times faster than RGB. These results contribute to the evidence found in related works, that using RAW images as direct input to computer vision algorithms looks very promising.
comment: https://www.kaggle.com/datasets/mathiasviborg/raw-instinct
☆ Biased Binary Attribute Classifiers Ignore the Majority Classes
To visualize the regions of interest that classifiers base their decisions on, different Class Activation Mapping (CAM) methods have been developed. However, all of these techniques target categorical classifiers only, though most real-world tasks are binary classification. In this paper, we extend gradient-based CAM techniques to work with binary classifiers and visualize the active regions for binary facial attribute classifiers. When training an unbalanced binary classifier on an imbalanced dataset, it is well-known that the majority class, i.e. the class with many training samples, is mostly predicted much better than minority class with few training instances. In our experiments on the CelebA dataset, we verify these results, when training an unbalanced classifier to extract 40 facial attributes simultaneously. One would expect that the biased classifier has learned to extract features mainly for the majority classes and that the proportional energy of the activations mainly reside in certain specific regions of the image where the attribute is located. However, we find very little regular activation for samples of majority classes, while the active regions for minority classes seem mostly reasonable and overlap with our expectations. These results suggest that biased classifiers mainly rely on bias activation for majority classes. When training a balanced classifier on the imbalanced data by employing attribute-specific class weights, majority and minority classes are classified similarly well and show expected activations for almost all attributes
☆ Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels CVPR 2024
This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task, since a question may have multiple answers. However, due to annotation costs, the labels in existing benchmarks are always extremely insufficient, typically one answer per question. As a result, existing works tend to directly treat all the unlabeled answers as negative labels, leading to limited ability for generalization. In this work, we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers, which contain rich knowledge about label priority as well as label-associated visual cues, thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model, we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings, and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.
comment: Accepted to CVPR 2024
☆ Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation
Deep learning-based image generation has seen significant advancements with diffusion models, notably improving the quality of generated images. Despite these developments, generating images with unseen characteristics beneficial for downstream tasks has received limited attention. To bridge this gap, we propose Style-Extracting Diffusion Models, featuring two conditioning mechanisms. Specifically, we utilize 1) a style conditioning mechanism which allows to inject style information of previously unseen images during image generation and 2) a content conditioning which can be targeted to a downstream task, e.g., layout for segmentation. We introduce a trainable style encoder to extract style information from images, and an aggregation block that merges style information from multiple style inputs. This architecture enables the generation of images with unseen styles in a zero-shot manner, by leveraging styles from unseen images, resulting in more diverse generations. In this work, we use the image layout as target condition and first show the capability of our method on a natural image dataset as a proof-of-concept. We further demonstrate its versatility in histopathology, where we combine prior knowledge about tissue composition and unannotated data to create diverse synthetic images with known layouts. This allows us to generate additional synthetic data to train a segmentation network in a semi-supervised fashion. We verify the added value of the generated images by showing improved segmentation results and lower performance variability between patients when synthetic images are included during segmentation training. Our code will be made publicly available at [LINK].
☆ DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning
Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our \emph{differentially private retrieval-augmented diffusion model} (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of $\epsilon=10$, while providing a $3.5$ point improvement in FID compared to public-only retrieval for up to $10,000$ queries.
☆ OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation CVPR 2024
The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models, especially in 3D semantic segmentation. However, sparse CNNs are still valuable networks, due to their efficiency treasure, and ease of application. In this work, we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We discover that the key credit to the performance difference is adaptivity. Specifically, we propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost. Without any self-attention modules, OA-CNNs favorably surpass point transformers in terms of accuracy in both indoor and outdoor scenes, with much less latency and memory cost. Notably, it achieves 76.1%, 78.9%, and 70.6% mIoU on ScanNet v2, nuScenes, and SemanticKITTI validation benchmarks respectively, while maintaining at most 5x better speed than transformer counterparts. This revelation highlights the potential of pure sparse CNNs to outperform transformer-related networks.
comment: CVPR 2024
☆ CombiNeRF: A Combination of Regularization Techniques for Few-Shot Neural Radiance Field View Synthesis 3DV
Neural Radiance Fields (NeRFs) have shown impressive results for novel view synthesis when a sufficiently large amount of views are available. When dealing with few-shot settings, i.e. with a small set of input views, the training could overfit those views, leading to artifacts and geometric and chromatic inconsistencies in the resulting rendering. Regularization is a valid solution that helps NeRF generalization. On the other hand, each of the most recent NeRF regularization techniques aim to mitigate a specific rendering problem. Starting from this observation, in this paper we propose CombiNeRF, a framework that synergically combines several regularization techniques, some of them novel, in order to unify the benefits of each. In particular, we regularize single and neighboring rays distributions and we add a smoothness term to regularize near geometries. After these geometric approaches, we propose to exploit Lipschitz regularization to both NeRF density and color networks and to use encoding masks for input features regularization. We show that CombiNeRF outperforms the state-of-the-art methods with few-shot settings in several publicly available datasets. We also present an ablation study on the LLFF and NeRF-Synthetic datasets that support the choices made. We release with this paper the open-source implementation of our framework.
comment: This paper has been accepted for publication at the 2024 International Conference on 3D Vision (3DV)
☆ GLC++: Source-Free Universal Domain Adaptation through Global-Local Clustering and Contrastive Affinity Learning CVPR 2023
Deep neural networks often exhibit sub-optimal performance under covariate and category shifts. Source-Free Domain Adaptation (SFDA) presents a promising solution to this dilemma, yet most SFDA approaches are restricted to closed-set scenarios. In this paper, we explore Source-Free Universal Domain Adaptation (SF-UniDA) aiming to accurately classify "known" data belonging to common categories and segregate them from target-private "unknown" data. We propose a novel Global and Local Clustering (GLC) technique, which comprises an adaptive one-vs-all global clustering algorithm to discern between target classes, complemented by a local k-NN clustering strategy to mitigate negative transfer. Despite the effectiveness, the inherent closed-set source architecture leads to uniform treatment of "unknown" data, impeding the identification of distinct "unknown" categories. To address this, we evolve GLC to GLC++, integrating a contrastive affinity learning strategy. We examine the superiority of GLC and GLC++ across multiple benchmarks and category shift scenarios. Remarkably, in the most challenging open-partial-set scenarios, GLC and GLC++ surpass GATE by 16.7% and 18.6% in H-score on VisDA, respectively. GLC++ enhances the novel category clustering accuracy of GLC by 4.3% in open-set scenarios on Office-Home. Furthermore, the introduced contrastive learning strategy not only enhances GLC but also significantly facilitates existing methodologies.
comment: This is a substantial extension of the CVPR 2023 paper "Upcycling Models under Domain and Category Shift"
☆ Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination
Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs completely oblivious to accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may simultaneously advocate both accurate and non-existent content. To address this issue, we propose Pensieve, a training-free method inspired by our observation that analogous visual hallucinations can arise among images sharing common semantic and appearance characteristics. During inference, Pensieve enables MLLMs to retrospect relevant images as references and compare them with the test image. This paradigm assists MLLMs in downgrading hallucinatory content mistakenly supported by the visual input. Experiments on Whoops, MME, POPE, and LLaVA Bench demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Additionally, Pensieve aids MLLMs in identifying details in the image and enhancing the specificity of image descriptions.
☆ A Bag of Tricks for Few-Shot Class-Incremental Learning
We present a bag of tricks framework for few-shot class-incremental learning (FSCIL), which is a challenging form of continual learning that involves continuous adaptation to new tasks with limited samples. FSCIL requires both stability and adaptability, i.e., preserving proficiency in previously learned tasks while learning new ones. Our proposed bag of tricks brings together eight key and highly influential techniques that improve stability, adaptability, and overall performance under a unified framework for FSCIL. We organize these tricks into three categories: stability tricks, adaptability tricks, and training tricks. Stability tricks aim to mitigate the forgetting of previously learned classes by enhancing the separation between the embeddings of learned classes and minimizing interference when learning new ones. On the other hand, adaptability tricks focus on the effective learning of new classes. Finally, training tricks improve the overall performance without compromising stability or adaptability. We perform extensive experiments on three benchmark datasets, CIFAR-100, CUB-200, and miniIMageNet, to evaluate the impact of our proposed framework. Our detailed analysis shows that our approach substantially improves both stability and adaptability, establishing a new state-of-the-art by outperforming prior works in the area. We believe our method provides a go-to solution and establishes a robust baseline for future research in this area.
☆ Tensor network compressibility of convolutional models
Convolutional neural networks (CNNs) represent one of the most widely used neural network architectures, showcasing state-of-the-art performance in computer vision tasks. Although larger CNNs generally exhibit higher accuracy, their size can be effectively reduced by "tensorization" while maintaining accuracy. Tensorization consists of replacing the convolution kernels with compact decompositions such as Tucker, Canonical Polyadic decompositions, or quantum-inspired decompositions such as matrix product states, and directly training the factors in the decompositions to bias the learning towards low-rank decompositions. But why doesn't tensorization seem to impact the accuracy adversely? We explore this by assessing how truncating the convolution kernels of dense (untensorized) CNNs impact their accuracy. Specifically, we truncated the kernels of (i) a vanilla four-layer CNN and (ii) ResNet-50 pre-trained for image classification on CIFAR-10 and CIFAR-100 datasets. We found that kernels (especially those inside deeper layers) could often be truncated along several cuts resulting in significant loss in kernel norm but not in classification accuracy. This suggests that such ``correlation compression'' (underlying tensorization) is an intrinsic feature of how information is encoded in dense CNNs. We also found that aggressively truncated models could often recover the pre-truncation accuracy after only a few epochs of re-training, suggesting that compressing the internal correlations of convolution layers does not often transport the model to a worse minimum. Our results can be applied to tensorize and compress CNN models more effectively.
comment: 20 pages, 21 images
☆ InfNeRF: Towards Infinite Scale NeRF Rendering with O(log n) Space Complexity
The conventional mesh-based Level of Detail (LoD) technique, exemplified by applications such as Google Earth and many game engines, exhibits the capability to holistically represent a large scene even the Earth, and achieves rendering with a space complexity of O(log n). This constrained data requirement not only enhances rendering efficiency but also facilitates dynamic data fetching, thereby enabling a seamless 3D navigation experience for users. In this work, we extend this proven LoD technique to Neural Radiance Fields (NeRF) by introducing an octree structure to represent the scenes in different scales. This innovative approach provides a mathematically simple and elegant representation with a rendering space complexity of O(log n), aligned with the efficiency of mesh-based LoD techniques. We also present a novel training strategy that maintains a complexity of O(n). This strategy allows for parallel training with minimal overhead, ensuring the scalability and efficiency of our proposed method. Our contribution is not only in extending the capabilities of existing techniques but also in establishing a foundation for scalable and efficient large-scale scene representation using NeRF and octree structures.
☆ SyncTweedies: A General Generative Framework Based on Synchronized Diffusions
We introduce a general framework for generating diverse visual content, including ambiguous images, panorama images, mesh textures, and Gaussian splat textures, by synchronizing multiple diffusion processes. We present exhaustive investigation into all possible scenarios for synchronizing multiple diffusion processes through a canonical space and analyze their characteristics across applications. In doing so, we reveal a previously unexplored case: averaging the outputs of Tweedie's formula while conducting denoising in multiple instance spaces. This case also provides the best quality with the widest applicability to downstream tasks. We name this case SyncTweedies. In our experiments generating visual content aforementioned, we demonstrate the superior quality of generation by SyncTweedies compared to other synchronization methods, optimization-based and iterative-update-based methods.
comment: Project page: https://synctweedies.github.io/
☆ Enabling Visual Composition and Animation in Unsupervised Video Generation
In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way. This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration. We conduct a series of experiments to demonstrate capabilities of CAGE in various settings. Project website: https://araachie.github.io/cage.
comment: Project website: https://araachie.github.io/cage
☆ SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field
Vision-centric 3D environment understanding is both vital and challenging for autonomous driving systems. Recently, object-free methods have attracted considerable attention. Such methods perceive the world by predicting the semantics of discrete voxel grids but fail to construct continuous and accurate obstacle surfaces. To this end, in this paper, we propose SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images. Specifically, we introduce a query-based approach and utilize SDF constrained by the Eikonal formulation to accurately describe the surfaces of obstacles. Furthermore, considering the absence of precise SDF ground truth, we propose a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which emphasizes applying correct and dense constraints on both sides of the surface, thereby enhancing the perceptual accuracy of the surface. Experiments suggest that our method achieves SOTA for both occupancy prediction and 3D scene reconstruction tasks on the nuScenes dataset.
☆ Less but Better: Enabling Generalized Zero-shot Learning Towards Unseen Domains by Intrinsic Learning from Redundant LLM Semantics
Generalized zero-shot learning (GZSL) focuses on recognizing seen and unseen classes against domain shift problem (DSP) where data of unseen classes may be misclassified as seen classes. However, existing GZSL is still limited to seen domains. In the current work, we pioneer cross-domain GZSL (CDGZSL) which addresses GZSL towards unseen domains. Different from existing GZSL methods which alleviate DSP by generating features of unseen classes with semantics, CDGZSL needs to construct a common feature space across domains and acquire the corresponding intrinsic semantics shared among domains to transfer from seen to unseen domains. Considering the information asymmetry problem caused by redundant class semantics annotated with large language models (LLMs), we present Meta Domain Alignment Semantic Refinement (MDASR). Technically, MDASR consists of two parts: Inter-class Similarity Alignment (ISA), which eliminates the non-intrinsic semantics not shared across all domains under the guidance of inter-class feature relationships, and Unseen-class Meta Generation (UMG), which preserves intrinsic semantics to maintain connectivity between seen and unseen classes by simulating feature generation. MDASR effectively aligns the redundant semantic space with the common feature space, mitigating the information asymmetry in CDGZSL. The effectiveness of MDASR is demonstrated on the Office-Home and Mini-DomainNet, and we have shared the LLM-based semantics for these datasets as the benchmark.
comment: This work is submitted to IEEE TNNLS and is subject to IEEE copyright
☆ Varroa destructor detection on honey bees using hyperspectral imagery
Hyperspectral (HS) imagery in agriculture is becoming increasingly common. These images have the advantage of higher spectral resolution. Advanced spectral processing techniques are required to unlock the information potential in these HS images. The present paper introduces a method rooted in multivariate statistics designed to detect parasitic Varroa destructor mites on the body of western honey bee Apis mellifera, enabling easier and continuous monitoring of the bee hives. The methodology explores unsupervised (K-means++) and recently developed supervised (Kernel Flows - Partial Least-Squares, KF-PLS) methods for parasitic identification. Additionally, in light of the emergence of custom-band multispectral cameras, the present research outlines a strategy for identifying the specific wavelengths necessary for effective bee-mite separation, suitable for implementation in a custom-band camera. Illustrated with a real-case dataset, our findings demonstrate that as few as four spectral bands are sufficient for accurate parasite identification.
☆ LDTR: Transformer-based Lane Detection with Anchor-chain Representation
Despite recent advances in lane detection methods, scenarios with limited- or no-visual-clue of lanes due to factors such as lighting conditions and occlusion remain challenging and crucial for automated driving. Moreover, current lane representations require complex post-processing and struggle with specific instances. Inspired by the DETR architecture, we propose LDTR, a transformer-based model to address these issues. Lanes are modeled with a novel anchor-chain, regarding a lane as a whole from the beginning, which enables LDTR to handle special lanes inherently. To enhance lane instance perception, LDTR incorporates a novel multi-referenced deformable attention module to distribute attention around the object. Additionally, LDTR incorporates two line IoU algorithms to improve convergence efficiency and employs a Gaussian heatmap auxiliary branch to enhance model representation capability during training. To evaluate lane detection models, we rely on Frechet distance, parameterized F1-score, and additional synthetic metrics. Experimental results demonstrate that LDTR achieves state-of-the-art performance on well-known datasets.
comment: Accepted by CVM 2024 and CVMJ. 16 pages, 14 figures
☆ Annotation-Efficient Polyp Segmentation via Active Learning
Deep learning-based techniques have proven effective in polyp segmentation tasks when provided with sufficient pixel-wise labeled data. However, the high cost of manual annotation has created a bottleneck for model generalization. To minimize annotation costs, we propose a deep active learning framework for annotation-efficient polyp segmentation. In practice, we measure the uncertainty of each sample by examining the similarity between features masked by the prediction map of the polyp and the background area. Since the segmentation model tends to perform weak in samples with indistinguishable features of foreground and background areas, uncertainty sampling facilitates the fitting of under-learning data. Furthermore, clustering image-level features weighted by uncertainty identify samples that are both uncertain and representative. To enhance the selectivity of the active selection strategy, we propose a novel unsupervised feature discrepancy learning mechanism. The selection strategy and feature optimization work in tandem to achieve optimal performance with a limited annotation budget. Extensive experimental results have demonstrated that our proposed method achieved state-of-the-art performance compared to other competitors on both a public dataset and a large-scale in-house dataset.
comment: 2024 IEEE 21th International Symposium on Biomedical Imaging (ISBI)
☆ On the Concept Trustworthiness in Concept Bottleneck Models
Concept Bottleneck Models (CBMs), which break down the reasoning process into the input-to-concept mapping and the concept-to-label prediction, have garnered significant attention due to their remarkable interpretability achieved by the interpretable concept bottleneck. However, despite the transparency of the concept-to-label prediction, the mapping from the input to the intermediate concept remains a black box, giving rise to concerns about the trustworthiness of the learned concepts (i.e., these concepts may be predicted based on spurious cues). The issue of concept untrustworthiness greatly hampers the interpretability of CBMs, thereby hindering their further advancement. To conduct a comprehensive analysis on this issue, in this study we establish a benchmark to assess the trustworthiness of concepts in CBMs. A pioneering metric, referred to as concept trustworthiness score, is proposed to gauge whether the concepts are derived from relevant regions. Additionally, an enhanced CBM is introduced, enabling concept predictions to be made specifically from distinct parts of the feature map, thereby facilitating the exploration of their related regions. Besides, we introduce three modules, namely the cross-layer alignment (CLA) module, the cross-image alignment (CIA) module, and the prediction alignment (PA) module, to further enhance the concept trustworthiness within the elaborated CBM. The experiments on five datasets across ten architectures demonstrate that without using any concept localization annotations during training, our model improves the concept trustworthiness by a large margin, meanwhile achieving superior accuracy to the state-of-the-arts. Our code is available at https://github.com/hqhQAQ/ProtoCBM.
☆ Towards Efficient Information Fusion: Concentric Dual Fusion Attention Based Multiple Instance Learning for Whole Slide Images
In the realm of digital pathology, multi-magnification Multiple Instance Learning (multi-mag MIL) has proven effective in leveraging the hierarchical structure of Whole Slide Images (WSIs) to reduce information loss and redundant data. However, current methods fall short in bridging the domain gap between pretrained models and medical imaging, and often fail to account for spatial relationships across different magnifications. Addressing these challenges, we introduce the Concentric Dual Fusion Attention-MIL (CDFA-MIL) framework,which innovatively combines point-to-area feature-colum attention and point-to-point concentric-row attention using concentric patch. This approach is designed to effectively fuse correlated information, enhancing feature representation and providing stronger correlation guidance for WSI analysis. CDFA-MIL distinguishes itself by offering a robust fusion strategy that leads to superior WSI recognition. Its application has demonstrated exceptional performance, significantly surpassing existing MIL methods in accuracy and F1 scores on prominent datasets like Camelyon16 and TCGA-NSCLC. Specifically, CDFA-MIL achieved an average accuracy and F1-score of 93.7\% and 94.1\% respectively on these datasets, marking a notable advancement over traditional MIL approaches.
comment: 14 pages, 7 figures
☆ $\nabla τ$: Gradient-based and Task-Agnostic machine Unlearning
Machine Unlearning, the process of selectively eliminating the influence of certain data examples used during a model's training, has gained significant attention as a means for practitioners to comply with recent data protection regulations. However, existing unlearning methods face critical drawbacks, including their prohibitively high cost, often associated with a large number of hyperparameters, and the limitation of forgetting only relatively small data portions. This often makes retraining the model from scratch a quicker and more effective solution. In this study, we introduce Gradient-based and Task-Agnostic machine Unlearning ($\nabla \tau$), an optimization framework designed to remove the influence of a subset of training data efficiently. It applies adaptive gradient ascent to the data to be forgotten while using standard gradient descent for the remaining data. $\nabla \tau$ offers multiple benefits over existing approaches. It enables the unlearning of large sections of the training dataset (up to 30%). It is versatile, supporting various unlearning tasks (such as subset forgetting or class removal) and applicable across different domains (images, text, etc.). Importantly, $\nabla \tau$ requires no hyperparameter adjustments, making it a more appealing option than retraining the model from scratch. We evaluate our framework's effectiveness using a set of well-established Membership Inference Attack metrics, demonstrating up to 10% enhancements in performance compared to state-of-the-art methods without compromising the original model's accuracy.
comment: 14 pages, 2 figures
☆ FFT-based Selection and Optimization of Statistics for Robust Recognition of Severely Corrupted Images ICASSP 2024
Improving model robustness in case of corrupted images is among the key challenges to enable robust vision systems on smart devices, such as robotic agents. Particularly, robust test-time performance is imperative for most of the applications. This paper presents a novel approach to improve robustness of any classification model, especially on severely corrupted images. Our method (FROST) employs high-frequency features to detect input image corruption type, and select layer-wise feature normalization statistics. FROST provides the state-of-the-art results for different models and datasets, outperforming competitors on ImageNet-C by up to 37.1% relative gain, improving baseline of 40.9% mCE on severe corruptions.
comment: ICASSP 2024. Copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other
☆ CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. Existing methods either rely on domain labels to align domain-invariant feature spaces, or disentangle generalizable features from the whole sample, which inevitably lead to the distortion of semantic feature structures and achieve limited generalization. In this work, we make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features. Specifically, we propose a novel Class Free Prompt Learning (CFPL) paradigm for DG FAS, which utilizes two lightweight transformers, namely Content Q-Former (CQF) and Style Q-Former (SQF), to learn the different semantic prompts conditioned on content and style features by using a set of learnable query vectors, respectively. Thus, the generalizable prompt can be learned by two improvements: (1) A Prompt-Text Matched (PTM) supervision is introduced to ensure CQF learns visual representation that is most informative of the content description. (2) A Diversified Style Prompt (DSP) technology is proposed to diversify the learning of style prompts by mixing feature statistics between instance-specific styles. Finally, the learned text features modulate visual features to generalization through the designed Prompt Modulation (PM). Extensive experiments show that the CFPL is effective and outperforms the state-of-the-art methods on several cross-domain datasets.
comment: 11 pages, 4 figures
☆ Neural Network-Based Processing and Reconstruction of Compromised Biophotonic Image Data
The integration of deep learning techniques with biophotonic setups has opened new horizons in bioimaging. A compelling trend in this field involves deliberately compromising certain measurement metrics to engineer better bioimaging tools in terms of cost, speed, and form-factor, followed by compensating for the resulting defects through the utilization of deep learning models trained on a large amount of ideal, superior or alternative data. This strategic approach has found increasing popularity due to its potential to enhance various aspects of biophotonic imaging. One of the primary motivations for employing this strategy is the pursuit of higher temporal resolution or increased imaging speed, critical for capturing fine dynamic biological processes. This approach also offers the prospect of simplifying hardware requirements/complexities, thereby making advanced imaging standards more accessible in terms of cost and/or size. This article provides an in-depth review of the diverse measurement aspects that researchers intentionally impair in their biophotonic setups, including the point spread function, signal-to-noise ratio, sampling density, and pixel resolution. By deliberately compromising these metrics, researchers aim to not only recuperate them through the application of deep learning networks, but also bolster in return other crucial parameters, such as the field-of-view, depth-of-field, and space-bandwidth product. Here, we discuss various biophotonic methods that have successfully employed this strategic approach. These techniques span broad applications and showcase the versatility and effectiveness of deep learning in the context of compromised biophotonic data. Finally, by offering our perspectives on the future possibilities of this rapidly evolving concept, we hope to motivate our readers to explore novel ways of balancing hardware compromises with compensation via AI.
comment: 17 Pages, 4 Figures, 1 Table
☆ Exosense: A Vision-Centric Scene Understanding System For Safe Exoskeleton Navigation
Exoskeletons for daily use by those with mobility impairments are being developed. They will require accurate and robust scene understanding systems. Current research has used vision to identify immediate terrain and geometric obstacles, however these approaches are constrained to detections directly in front of the user and are limited to classifying a finite range of terrain types (e.g., stairs, ramps and level-ground). This paper presents Exosense, a vision-centric scene understanding system which is capable of generating rich, globally-consistent elevation maps, incorporating both semantic and terrain traversability information. It features an elastic Atlas mapping framework associated with a visual SLAM pose graph, embedded with open-vocabulary room labels from a Vision-Language Model (VLM). The device's design includes a wide field-of-view (FoV) fisheye multi-camera system to mitigate the challenges introduced by the exoskeleton walking pattern. We demonstrate the system's robustness to the challenges of typical periodic walking gaits, and its ability to construct accurate semantically-rich maps in indoor settings. Additionally, we showcase its potential for motion planning -- providing a step towards safe navigation for exoskeletons.
comment: 8 pages, 10 figures
☆ A Lightweight Attention-based Deep Network via Multi-Scale Feature Fusion for Multi-View Facial Expression Recognition
Convolutional neural networks (CNNs) and their variations have shown effectiveness in facial expression recognition (FER). However, they face challenges when dealing with high computational complexity and multi-view head poses in real-world scenarios. We introduce a lightweight attentional network incorporating multi-scale feature fusion (LANMSFF) to tackle these issues. For the first challenge, we have carefully designed a lightweight fully convolutional network (FCN). We address the second challenge by presenting two novel components, namely mass attention (MassAtt) and point wise feature selection (PWFS) blocks. The MassAtt block simultaneously generates channel and spatial attention maps to recalibrate feature maps by emphasizing important features while suppressing irrelevant ones. On the other hand, the PWFS block employs a feature selection mechanism that discards less meaningful features prior to the fusion process. This mechanism distinguishes it from previous methods that directly fuse multi-scale features. Our proposed approach achieved results comparable to state-of-the-art methods in terms of parameter counts and robustness to pose variation, with accuracy rates of 90.77% on KDEF, 70.44% on FER-2013, and 86.96% on FERPlus datasets. The code for LANMSFF is available at https://github.com/AE-1129/LANMSFF.
comment: 9 pages, two-column, submitted to journal
☆ SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks CVPR
The remarkable success of Vision Transformers in Artificial Neural Networks (ANNs) has led to a growing interest in incorporating the self-attention mechanism and transformer-based architecture into Spiking Neural Networks (SNNs). While existing methods propose spiking self-attention mechanisms that are compatible with SNNs, they lack reasonable scaling methods, and the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting local features. To address these challenges, we propose a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a reasonable scaling method. Based on DSSA, we propose a novel spiking Vision Transformer architecture called SpikingResformer, which combines the ResNet-based multi-stage architecture with our proposed DSSA to improve both performance and energy efficiency while reducing parameters. Experimental results show that SpikingResformer achieves higher accuracy with fewer parameters and lower energy consumption than other spiking Vision Transformer counterparts. Notably, our SpikingResformer-L achieves 79.40% top-1 accuracy on ImageNet with 4 time-steps, which is the state-of-the-art result in the SNN field.
comment: To be published in the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
☆ Impact Assessment of Missing Data in Model Predictions for Earth Observation Applications
Earth observation (EO) applications involving complex and heterogeneous data sources are commonly approached with machine learning models. However, there is a common assumption that data sources will be persistently available. Different situations could affect the availability of EO sources, like noise, clouds, or satellite mission failures. In this work, we assess the impact of missing temporal and static EO sources in trained models across four datasets with classification and regression tasks. We compare the predictive quality of different methods and find that some are naturally more robust to missing data. The Ensemble strategy, in particular, achieves a prediction robustness up to 100%. We evidence that missing scenarios are significantly more challenging in regression than classification tasks. Finally, we find that the optical view is the most critical view when it is missing individually.
comment: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2024
☆ HySim: An Efficient Hybrid Similarity Measure for Patch Matching in Image Inpainting
Inpainting, for filling missing image regions, is a crucial task in various applications, such as medical imaging and remote sensing. Trending data-driven approaches efficiency, for image inpainting, often requires extensive data preprocessing. In this sense, there is still a need for model-driven approaches in case of application constrained with data availability and quality, especially for those related for time series forecasting using image inpainting techniques. This paper proposes an improved modeldriven approach relying on patch-based techniques. Our approach deviates from the standard Sum of Squared Differences (SSD) similarity measure by introducing a Hybrid Similarity (HySim), which combines both strengths of Chebychev and Minkowski distances. This hybridization enhances patch selection, leading to high-quality inpainting results with reduced mismatch errors. Experimental results proved the effectiveness of our approach against other model-driven techniques, such as diffusion or patch-based approaches, showcasing its effectiveness in achieving visually pleasing restorations.
☆ Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts, models such as Stable Diffusion have been successfully extended to the joint generation of semantic segmentation pseudo-masks. However, current extensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the generation of segmentation masks derived from word tokens not contained in the text prompt. In this work, we introduce Open-Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. In addition, we propose a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining.
☆ Exploring Green AI for Audio Deepfake Detection
The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is equivalent to five times of average US car emission at its lifetime. This is certainly a massive threat to the environment. To tackle this challenge, this study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources. Our proposed framework utilizes off-the-shelve self-supervised learning (SSL) based models which are pre-trained and available in public repositories. In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model. Our approach shows competitive results compared to the commonly used high-carbon footprint approaches. In experiments with the ASVspoof 2019 LA dataset, we achieve a 0.90\% equal error rate (EER) with less than 1k trainable model parameters. To encourage further research in this direction and support reproducible results, the Python code will be made publicly accessible following acceptance. Github: https://github.com/sahasubhajit/Speech-Spoofing-
comment: This manuscript is under review in a conference
☆ Enhancing Historical Image Retrieval with Compositional Cues
In analyzing vast amounts of digitally stored historical image data, existing content-based retrieval methods often overlook significant non-semantic information, limiting their effectiveness for flexible exploration across varied themes. To broaden the applicability of image retrieval methods for diverse purposes and uncover more general patterns, we innovatively introduce a crucial factor from computational aesthetics, namely image composition, into this topic. By explicitly integrating composition-related information extracted by CNN into the designed retrieval model, our method considers both the image's composition rules and semantic information. Qualitative and quantitative experiments demonstrate that the image retrieval network guided by composition information outperforms those relying solely on content information, facilitating the identification of images in databases closer to the target image in human perception. Please visit https://github.com/linty5/CCBIR to try our codes.
☆ Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
comment: Manuscript Under Review
☆ Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation
Estimating the pose of objects through vision is essential to make robotic platforms interact with the environment. Yet, it presents many challenges, often related to the lack of flexibility and generalizability of state-of-the-art solutions. Diffusion models are a cutting-edge neural architecture transforming 2D and 3D computer vision, outlining remarkable performances in zero-shot novel-view synthesis. Such a use case is particularly intriguing for reconstructing 3D objects. However, localizing objects in unstructured environments is rather unexplored. To this end, this work presents Zero123-6D to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level by integrating them with feature extraction techniques. The outlined method exploits such a novel view synthesizer to expand a sparse set of RGB-only reference views for the zero-shot 6D pose estimation task. Experiments are quantitatively analyzed on the CO3D dataset, showcasing increased performance over baselines, a substantial reduction in data requirements, and the removal of the necessity of depth information.
comment: 6 pages, 2 reference pages, 4 figures
☆ Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
☆ A Framework for Portrait Stylization with Skin-Tone Awareness and Nudity Identification ICASSP 2024
Portrait stylization is a challenging task involving the transformation of an input portrait image into a specific style while preserving its inherent characteristics. The recent introduction of Stable Diffusion (SD) has significantly improved the quality of outcomes in this field. However, a practical stylization framework that can effectively filter harmful input content and preserve the distinct characteristics of an input, such as skin-tone, while maintaining the quality of stylization remains lacking. These challenges have hindered the wide deployment of such a framework. To address these issues, this study proposes a portrait stylization framework that incorporates a nudity content identification module (NCIM) and a skin-tone-aware portrait stylization module (STAPSM). In experiments, NCIM showed good performance in enhancing explicit content filtering, and STAPSM accurately represented a diverse range of skin tones. Our proposed framework has been successfully deployed in practice, and it has effectively satisfied critical requirements of real-world applications.
comment: Accepted to ICASSP 2024
☆ Diffusion Models with Ensembled Structure-Based Anomaly Scoring for Unsupervised Anomaly Detection
Supervised deep learning techniques show promise in medical image analysis. However, they require comprehensive annotated data sets, which poses challenges, particularly for rare diseases. Consequently, unsupervised anomaly detection (UAD) emerges as a viable alternative for pathology segmentation, as only healthy data is required for training. However, recent UAD anomaly scoring functions often focus on intensity only and neglect structural differences, which impedes the segmentation performance. This work investigates the potential of Structural Similarity (SSIM) to bridge this gap. SSIM captures both intensity and structural disparities and can be advantageous over the classical $l1$ error. However, we show that there is more than one optimal kernel size for the SSIM calculation for different pathologies. Therefore, we investigate an adaptive ensembling strategy for various kernel sizes to offer a more pathology-agnostic scoring mechanism. We demonstrate that this ensembling strategy can enhance the performance of DMs and mitigate the sensitivity to different kernel sizes across varying pathologies, highlighting its promise for brain MRI anomaly detection.
comment: Accepted at IEEE ISBI 2024
☆ LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding LREC
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale language models (LLMs). By leveraging the strengths of existing research in document image understanding and LLMs' superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
comment: LREC-COLING 2024
☆ Safeguarding Medical Image Segmentation Datasets against Unauthorized Training via Contour- and Texture-Aware Perturbations
The widespread availability of publicly accessible medical images has significantly propelled advancements in various research and clinical fields. Nonetheless, concerns regarding unauthorized training of AI systems for commercial purposes and the duties of patient privacy protection have led numerous institutions to hesitate to share their images. This is particularly true for medical image segmentation (MIS) datasets, where the processes of collection and fine-grained annotation are time-intensive and laborious. Recently, Unlearnable Examples (UEs) methods have shown the potential to protect images by adding invisible shortcuts. These shortcuts can prevent unauthorized deep neural networks from generalizing. However, existing UEs are designed for natural image classification and fail to protect MIS datasets imperceptibly as their protective perturbations are less learnable than important prior knowledge in MIS, e.g., contour and texture features. To this end, we propose an Unlearnable Medical image generation method, termed UMed. UMed integrates the prior knowledge of MIS by injecting contour- and texture-aware perturbations to protect images. Given that our target is to only poison features critical to MIS, UMed requires only minimal perturbations within the ROI and its contour to achieve greater imperceptibility (average PSNR is 50.03) and protective performance (clean average DSC degrades from 82.18% to 6.80%).
☆ ResNet101 and DAE for Enhance Quality and Classification Accuracy in Skin Cancer Imaging
Skin cancer is a crucial health issue that requires timely detection for higher survival rates. Traditional computer vision techniques face challenges in addressing the advanced variability of skin lesion features, a gap partially bridged by convolutional neural networks (CNNs). To overcome the existing issues, we introduce an innovative convolutional ensemble network approach named deep autoencoder (DAE) with ResNet101. This method utilizes convolution-based deep neural networks for the detection of skin cancer. The ISIC-2018 public data taken from the source is used for experimental results, which demonstrate remarkable performance with the different in terms of performance metrics. The methods result in 96.03% of accuracy, 95.40 % of precision, 96.05% of recall, 0.9576 of F-measure, 0.98 of AUC.
comment: 6 Pages; 14 figures; 3 tables
☆ Isotropic Gaussian Splatting for Real-Time Radiance Field Rendering
The 3D Gaussian splatting method has drawn a lot of attention, thanks to its high performance in training and high quality of the rendered image. However, it uses anisotropic Gaussian kernels to represent the scene. Although such anisotropic kernels have advantages in representing the geometry, they lead to difficulties in terms of computation, such as splitting or merging two kernels. In this paper, we propose to use isotropic Gaussian kernels to avoid such difficulties in the computation, leading to a higher performance method. The experiments confirm that the proposed method is about {\bf 100X} faster without losing the geometry representation accuracy. The proposed method can be applied in a large range applications where the radiance field is needed, such as 3D reconstruction, view synthesis, and dynamic object modeling.
☆ Dermacen Analytica: A Novel Methodology Integrating Multi-Modal Large Language Models with Machine Learning in tele-dermatology
The rise of Artificial Intelligence creates great promise in the field of medical discovery, diagnostics and patient management. However, the vast complexity of all medical domains require a more complex approach that combines machine learning algorithms, classifiers, segmentation algorithms and, lately, large language models. In this paper, we describe, implement and assess an Artificial Intelligence-empowered system and methodology aimed at assisting the diagnosis process of skin lesions and other skin conditions within the field of dermatology that aims to holistically address the diagnostic process in this domain. The workflow integrates large language, transformer-based vision models and sophisticated machine learning tools. This holistic approach achieves a nuanced interpretation of dermatological conditions that simulates and facilitates a dermatologist's workflow. We assess our proposed methodology through a thorough cross-model validation technique embedded in an evaluation pipeline that utilizes publicly available medical case studies of skin conditions and relevant images. To quantitatively score the system performance, advanced machine learning and natural language processing tools are employed which focus on similarity comparison and natural language inference. Additionally, we incorporate a human expert evaluation process based on a structured checklist to further validate our results. We implemented the proposed methodology in a system which achieved approximate (weighted) scores of 0.87 for both contextual understanding and diagnostic accuracy, demonstrating the efficacy of our approach in enhancing dermatological analysis. The proposed methodology is expected to prove useful in the development of next-generation tele-dermatology applications, enhancing remote consultation capabilities and access to care, especially in underserved areas.
☆ Weak Supervision with Arbitrary Single Frame for Micro- and Macro-expression Spotting
Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point). To mitigate the issue of sparse label distribution, the prevailing solution is pseudo-label mining, which, however, introduces new problems: localizing contextual background snippets results in inaccurate boundaries and discarding foreground snippets leads to fragmentary predictions. Therefore, we design the strategies of multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems. Specifically, MPLG generates more reliable pseudo labels by merging class-specific probabilities, attention scores, fused features, and point-level labels. DFCL is utilized to enhance feature similarity for the same categories and feature variability for different categories while capturing global representations across the entire datasets. Extensive experiments on the CAS(ME)^2, CAS(ME)^3, and SAMM-LV datasets demonstrate PWES achieves promising performance comparable to that of recent fully-supervised methods.
☆ RG-CAT: Detection Pipeline and Catalogue of Radio Galaxies in the EMU Pilot Survey
We present source detection and catalogue construction pipelines to build the first catalogue of radio galaxies from the 270 $\rm deg^2$ pilot survey of the Evolutionary Map of the Universe (EMU-PS) conducted with the Australian Square Kilometre Array Pathfinder (ASKAP) telescope. The detection pipeline uses Gal-DINO computer-vision networks (Gupta et al., 2024) to predict the categories of radio morphology and bounding boxes for radio sources, as well as their potential infrared host positions. The Gal-DINO network is trained and evaluated on approximately 5,000 visually inspected radio galaxies and their infrared hosts, encompassing both compact and extended radio morphologies. We find that the Intersection over Union (IoU) for the predicted and ground truth bounding boxes is larger than 0.5 for 99% of the radio sources, and 98% of predicted host positions are within $3^{\prime \prime}$ of the ground truth infrared host in the evaluation set. The catalogue construction pipeline uses the predictions of the trained network on the radio and infrared image cutouts based on the catalogue of radio components identified using the Selavy source finder algorithm. Confidence scores of the predictions are then used to prioritize Selavy components with higher scores and incorporate them first into the catalogue. This results in identifications for a total of 211,625 radio sources, with 201,211 classified as compact and unresolved. The remaining 10,414 are categorized as extended radio morphologies, including 582 FR-I, 5,602 FR-II, 1,494 FR-x (uncertain whether FR-I or FR-II), 2,375 R (single-peak resolved) radio galaxies, and 361 with peculiar and other rare morphologies. We cross-match the radio sources in the catalogue with the infrared and optical catalogues, finding infrared cross-matches for 73% and photometric redshifts for 36% of the radio galaxies.
comment: Accepted for publication in PASA. The paper has 22 pages, 12 figures and 5 tables
☆ SoftPatch: Unsupervised Anomaly Detection with Noisy Data
Although mainstream unsupervised anomaly detection (AD) algorithms perform well in academic datasets, their performance is limited in practical application due to the ideal experimental setting of clean training data. Training with noisy data is an inevitable problem in real-world anomaly detection but is seldom discussed. This paper considers label-level noise in image sensory anomaly detection for the first time. To solve this problem, we proposed a memory-based unsupervised AD method, SoftPatch, which efficiently denoises the data at the patch level. Noise discriminators are utilized to generate outlier scores for patch-level noise elimination before coreset construction. The scores are then stored in the memory bank to soften the anomaly detection boundary. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset. Comprehensive experiments in various noise scenes demonstrate that SoftPatch outperforms the state-of-the-art AD methods on the MVTecAD and BTAD benchmarks and is comparable to those methods under the setting without noise.
comment: 36th Conference on Neural Information Processing Systems
☆ Toward Multi-class Anomaly Detection: Exploring Class-aware Unified Model against Inter-class Interference
In the context of high usability in single-class anomaly detection models, recent academic research has become concerned about the more complex multi-class anomaly detection. Although several papers have designed unified models for this task, they often overlook the utility of class labels, a potent tool for mitigating inter-class interference. To address this issue, we introduce a Multi-class Implicit Neural representation Transformer for unified Anomaly Detection (MINT-AD), which leverages the fine-grained category information in the training stage. By learning the multi-class distributions, the model generates class-aware query embeddings for the transformer decoder, mitigating inter-class interference within the reconstruction model. Utilizing such an implicit neural representation network, MINT-AD can project category and position information into a feature embedding space, further supervised by classification and prior probability loss functions. Experimental results on multiple datasets demonstrate that MINT-AD outperforms existing unified training models.
☆ Unsupervised Audio-Visual Segmentation with Modality Alignment
Audio-Visual Segmentation (AVS) aims to identify, at the pixel level, the object in a visual scene that produces a given sound. Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability. To address this, we introduce unsupervised AVS, eliminating the need for such expensive annotation. To tackle this more challenging problem, we propose an unsupervised learning method, named Modality Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf foundation models like DINO, SAM, and ImageBind. This approach leverages their knowledge complementarity and optimizes their joint usage for multi-modality association. Initially, we estimate positive and negative image pairs in the feature space. For pixel-level association, we introduce an audio-visual adapter and a novel pixel matching aggregation strategy within the image-level contrastive learning framework. This allows for a flexible connection between object appearance and audio signal at the pixel level, with tolerance to imaging variations such as translation and rotation. Extensive experiments on the AVSBench (single and multi-object splits) and AVSS datasets demonstrate that our MoCA outperforms strongly designed baseline methods and approaches supervised counterparts, particularly in complex scenarios with multiple auditory objects. Notably when comparing mIoU, MoCA achieves a substantial improvement over baselines in both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) audio-visual segmentation challenges.
☆ Debiasing surgeon: fantastic weights and how to find them
Nowadays an ever-growing concerning phenomenon, the emergence of algorithmic biases that can lead to unfair models, emerges. Several debiasing approaches have been proposed in the realm of deep learning, employing more or less sophisticated approaches to discourage these models from massively employing these biases. However, a question emerges: is this extra complexity really necessary? Is a vanilla-trained model already embodying some ``unbiased sub-networks'' that can be used in isolation and propose a solution without relying on the algorithmic biases? In this work, we show that such a sub-network typically exists, and can be extracted from a vanilla-trained model without requiring additional training. We further validate that such specific architecture is incapable of learning a specific bias, suggesting that there are possible architectural countermeasures to the problem of biases in deep neural networks.
☆ Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization CVPR2024
This paper investigates the effective utilization of unlabeled data for large-area cross-view geo-localization (CVGL), encompassing both unsupervised and semi-supervised settings. Common approaches to CVGL rely on ground-satellite image pairs and employ label-driven supervised training. However, the cost of collecting precise cross-view image pairs hinders the deployment of CVGL in real-life scenarios. Without the pairs, CVGL will be more challenging to handle the significant imaging and spatial gaps between ground and satellite images. To this end, we propose an unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by leveraging the fact that ``the perfectly paired ground-satellite image is located in a unique and identical scene". The framework exhibits competitive performance compared with supervised works on three open-source benchmarks. Our code and models will be released on https://github.com/liguopeng0923/UCVGL.
comment: Accepted by CVPR2024
☆ PECI-Net: Bolus segmentation from video fluoroscopic swallowing study images using preprocessing ensemble and cascaded inference
Bolus segmentation is crucial for the automated detection of swallowing disorders in videofluoroscopic swallowing studies (VFSS). However, it is difficult for the model to accurately segment a bolus region in a VFSS image because VFSS images are translucent, have low contrast and unclear region boundaries, and lack color information. To overcome these challenges, we propose PECI-Net, a network architecture for VFSS image analysis that combines two novel techniques: the preprocessing ensemble network (PEN) and the cascaded inference network (CIN). PEN enhances the sharpness and contrast of the VFSS image by combining multiple preprocessing algorithms in a learnable way. CIN reduces ambiguity in bolus segmentation by using context from other regions through cascaded inference. Moreover, CIN prevents undesirable side effects from unreliably segmented regions by referring to the context in an asymmetric way. In experiments, PECI-Net exhibited higher performance than four recently developed baseline models, outperforming TernausNet, the best among the baseline models, by 4.54\% and the widely used UNet by 10.83\%. The results of the ablation studies confirm that CIN and PEN are effective in improving bolus segmentation performance.
comment: 20 pages, 8 figures,
☆ StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
We propose a method that can generate cinemagraphs automatically from a still landscape image using a pre-trained StyleGAN. Inspired by the success of recent unconditional video generation, we leverage a powerful pre-trained image generator to synthesize high-quality cinemagraphs. Unlike previous approaches that mainly utilize the latent space of a pre-trained StyleGAN, our approach utilizes its deep feature space for both GAN inversion and cinemagraph generation. Specifically, we propose multi-scale deep feature warping (MSDFW), which warps the intermediate features of a pre-trained StyleGAN at different resolutions. By using MSDFW, the generated cinemagraphs are of high resolution and exhibit plausible looping animation. We demonstrate the superiority of our method through user studies and quantitative comparisons with state-of-the-art cinemagraph generation methods and a video generation method that uses a pre-trained StyleGAN.
comment: Project website: https://jeolpyeoni.github.io/stylecinegan_project/
♻ ☆ TD-MPC2: Scalable, Robust World Models for Continuous Control ICLR 2024
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com
comment: ICLR 2024. Explore videos, models, data, code, and more at https://tdmpc2.com
♻ ☆ Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model
Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, the Typographic Attack, which disrupts vision-language models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), has also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks may impact VLMs and LVLMs, leading to three highly insightful discoveries. By the examination of our discoveries and experimental validation in the Typographic Dataset, we reduce the performance degradation from $42.07\%$ to $13.90\%$ when LVLMs confront typographic attacks.
♻ ☆ The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
We present the All-Seeing Project V2: a new model and dataset designed for understanding object relations in images. Specifically, we propose the All-Seeing Model V2 (ASMv2) that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation (ReC) task. Leveraging this unified task, our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them, diminishing the relation hallucination often encountered by Multi-modal Large Language Models (MLLMs). To facilitate training and evaluation of MLLMs in relation understanding, we created the first high-quality ReC dataset ({AS-V2) which is aligned with the format of standard instruction tuning data. In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs. Notably, our ASMv2 achieves an overall accuracy of 52.04 on this relation-aware benchmark, surpassing the 43.14 of LLaVA-1.5 by a large margin. We hope that our work can inspire more future research and contribute to the evolution towards artificial general intelligence. Our project is released at https://github.com/OpenGVLab/all-seeing.
comment: Technical Report
♻ ☆ m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
♻ ☆ MedCycle: Unpaired Medical Report Generation via Cycle-Consistency
Generating medical reports for X-ray images presents a significant challenge, particularly in unpaired scenarios where access to paired image-report data for training is unavailable. Previous works have typically learned a joint embedding space for images and reports, necessitating a specific labeling schema for both. We introduce an innovative approach that eliminates the need for consistent labeling schemas, thereby enhancing data accessibility and enabling the use of incompatible datasets. This approach is based on cycle-consistent mapping functions that transform image embeddings into report embeddings, coupled with report auto-encoding for medical report generation. Our model and objectives consider intricate local details and the overarching semantic context within images and reports. This approach facilitates the learning of effective mapping functions, resulting in the generation of coherent reports. It outperforms state-of-the-art results in unpaired chest X-ray report generation, demonstrating improvements in both language and clinical metrics.
♻ ☆ A Geospatial Approach to Predicting Desert Locust Breeding Grounds in Africa
Desert locust swarms present a major threat to agriculture and food security. Addressing this challenge, our study develops an operationally-ready model for predicting locust breeding grounds, which has the potential to enhance early warning systems and targeted control measures. We curated a dataset from the United Nations Food and Agriculture Organization's (UN-FAO) locust observation records and analyzed it using two types of spatio-temporal input features: remotely-sensed environmental and climate data as well as multi-spectral earth observation images. Our approach employed custom deep learning models (three-dimensional and LSTM-based recurrent convolutional networks), along with the geospatial foundational model Prithvi recently released by Jakubik et al., 2023. These models notably outperformed existing baselines, with the Prithvi-based model, fine-tuned on multi-spectral images from NASA's Harmonized Landsat and Sentinel-2 (HLS) dataset, achieving the highest accuracy, F1 and ROC-AUC scores (83.03%, 81.53% and 87.69%, respectively). A significant finding from our research is that multi-spectral earth observation images alone are sufficient for effective locust breeding ground prediction without the need to explicitly incorporate climatic or environmental features.
♻ ☆ Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis
Recent progress in multi-modal conditioned face synthesis has enabled the creation of visually striking and accurately aligned facial images. Yet, current methods still face issues with scalability, limited flexibility, and a one-size-fits-all approach to control strength, not accounting for the differing levels of conditional entropy, a measure of unpredictability in data given some condition, across modalities. To address these challenges, we introduce a novel uni-modal training approach with modal surrogates, coupled with an entropy-aware modal-adaptive modulation, to support flexible, scalable, and scalable multi-modal conditioned face synthesis network. Our uni-modal training with modal surrogate that only leverage uni-modal data, use modal surrogate to decorate condition with modal-specific characteristic and serve as linker for inter-modal collaboration , fully learns each modality control in face synthesis process as well as inter-modal collaboration. The entropy-aware modal-adaptive modulation finely adjust diffusion noise according to modal-specific characteristics and given conditions, enabling well-informed step along denoising trajectory and ultimately leading to synthesis results of high fidelity and quality. Our framework improves multi-modal face synthesis under various conditions, surpassing current methods in image quality and fidelity, as demonstrated by our thorough experimental results.
♻ ☆ MedMamba: Vision Mamba for Medical Image Classification
Medical image classification is a very fundamental and crucial task in the field of computer vision. These years, CNN-based and Transformer-based models have been widely used to classify various medical images. Unfortunately, The limitation of CNNs in long-range modeling capabilities prevents them from effectively extracting features in medical images, while Transformers are hampered by their quadratic computational complexity. Recent research has shown that the state space model (SSM) represented by Mamba can efficiently model long-range interactions while maintaining linear computational complexity. Inspired by this, we propose Vision Mamba for medical image classification (MedMamba). More specifically, we introduce a novel Conv-SSM module. Conv-SSM combines the local feature extraction ability of convolutional layers with the ability of SSM to capture long-range dependency, thereby modeling medical images with different modalities. To demonstrate the potential of MedMamba, we conducted extensive experiments using 14 publicly available medical datasets with different imaging techniques and two private datasets built by ourselves. Extensive experimental results demonstrate that the proposed MedMamba performs well in detecting lesions in various medical images. To the best of our knowledge, this is the first Vision Mamba tailored for medical image classification. The purpose of this work is to establish a new baseline for medical image classification tasks and provide valuable insights for the future development of more efficient and effective SSM-based artificial intelligence algorithms and application systems in the medical. Source code has been available at https://github.com/YubiaoYue/MedMamba.
♻ ☆ Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation
As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of ``getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success). Our code will be made publicly available at https://github.com/XiaohanLei/IEVE.
♻ ☆ Generalizing deep learning models for medical image classification
Numerous Deep Learning (DL) models have been developed for a large spectrum of medical image analysis applications, which promises to reshape various facets of medical practice. Despite early advances in DL model validation and implementation, which encourage healthcare institutions to adopt them, some fundamental questions remain: are the DL models capable of generalizing? What causes a drop in DL model performances? How to overcome the DL model performance drop? Medical data are dynamic and prone to domain shift, due to multiple factors such as updates to medical equipment, new imaging workflow, and shifts in patient demographics or populations can induce this drift over time. In this paper, we review recent developments in generalization methods for DL-based classification models. We also discuss future challenges, including the need for improved evaluation protocols and benchmarks, and envisioned future developments to achieve robust, generalized models for medical image classification.
♻ ☆ Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models
In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications. However, it is challenging for the visual encoder in Large Vision-Language Models (LVLMs) to extract useful features tailored to questions that aid the language model's response. Furthermore, a common practice among existing LVLMs is to utilize lower-resolution images, which restricts the ability for visual recognition. Our work introduces the Chain-of-Spot (CoS) method, which we describe as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows LVLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features. By integrating Chain-of-Spot with instruct-following LLaVA-1.5 models, the process of image reasoning consistently improves performance across a wide range of multimodal datasets and benchmarks without bells and whistles and achieves new state-of-the-art results. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content, paving the way for more sophisticated visual instruction-following applications. Code and models are available at https://github.com/dongyh20/Chain-of-Spot
comment: Project Page: https://sites.google.com/view/chain-of-spot/
♻ ☆ Neural Radiance Fields in Medical Imaging: Challenges and Next Steps
Neural Radiance Fields (NeRF), as a pioneering technique in computer vision, offer great potential to revolutionize medical imaging by synthesizing three-dimensional representations from the projected two-dimensional image data. However, they face unique challenges when applied to medical applications. This paper presents a comprehensive examination of applications of NeRFs in medical imaging, highlighting four imminent challenges, including fundamental imaging principles, inner structure requirement, object boundary definition, and color density significance. We discuss current methods on different organs and discuss related limitations. We also review several datasets and evaluation metrics and propose several promising directions for future research.
♻ ☆ Learning a Depth Covariance Function CVPR 2023
We propose learning a depth covariance function with applications to geometric vision tasks. Given RGB images as input, the covariance function can be flexibly used to define priors over depth functions, predictive distributions given observations, and methods for active point selection. We leverage these techniques for a selection of downstream tasks: depth completion, bundle adjustment, and monocular dense visual odometry.
comment: CVPR 2023. Project page: https://edexheim.github.io/DepthCov/
♻ ☆ T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning
The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand for annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches.
comment: Under review
♻ ☆ Ins-HOI: Instance Aware Human-Object Interactions Recovery
Accurately modeling detailed interactions between human/hand and object is an appealing yet challenging task. Current multi-view capture systems are only capable of reconstructing multiple subjects into a single, unified mesh, which fails to model the states of each instance individually during interactions. To address this, previous methods use template-based representations to track human/hand and object. However, the quality of the reconstructions is limited by the descriptive capabilities of the templates so that these methods are inherently struggle with geometry details, pressing deformations and invisible contact surfaces. In this work, we propose an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation. However, the real-captured data is presented as a holistic mesh, unable to provide instance-level supervision. To address this, we further propose a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances. Specifically, synthetic data, created by randomly combining individual scans of humans/hands and objects, guides the network to learn a coarse prior of instances. Meanwhile, real-captured data helps in learning the overall geometry and restricting interpenetration in contact areas. As demonstrated in experiments, our method Ins-HOI supports instance-level reconstruction and provides reasonable and realistic invisible contact surfaces even in cases of extremely close interaction. To facilitate the research of this task, we collect a large-scale, high-fidelity 3D scan dataset, including 5.2k high-quality scans with real-world human-chair and hand-object interactions. The code and data will be public for research purposes.
comment: Project Page: https://jiajunzhang16.github.io/ins-hoi/ , Code and Dataset Page: https://github.com/jiajunzhang16/ins-hoi
♻ ☆ GIVT: Generative Infinite-Vocabulary Transformers
We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework
comment: v2: add related NLP work, loss details. v3: Improved GMM formulation, added adapter module, larger models, better image generation results. Code and model checkpoints are available at: https://github.com/google-research/big_vision
♻ ☆ Closing the Gap: Achieving Better Accuracy-Robustness Tradeoffs against Query-Based Attacks AAAI
Although promising, existing defenses against query-based attacks share a common limitation: they offer increased robustness against attacks at the price of a considerable accuracy drop on clean samples. In this work, we show how to efficiently establish, at test-time, a solid tradeoff between robustness and accuracy when mitigating query-based attacks. Given that these attacks necessarily explore low-confidence regions, our insight is that activating dedicated defenses, such as random noise defense and random image transformations, only for low-confidence inputs is sufficient to prevent them. Our approach is independent of training and supported by theory. We verify the effectiveness of our approach for various existing defenses by conducting extensive experiments on CIFAR-10, CIFAR-100, and ImageNet. Our results confirm that our proposal can indeed enhance these defenses by providing better tradeoffs between robustness and accuracy when compared to state-of-the-art approaches while being completely training-free.
comment: To appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2024
♻ ☆ Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations CVPR2024
The perception of motion behavior in a dynamic environment holds significant importance for autonomous driving systems, wherein class-agnostic motion prediction methods directly predict the motion of the entire point cloud. While most existing methods rely on fully-supervised learning, the manual labeling of point cloud data is laborious and time-consuming. Therefore, several annotation-efficient methods have been proposed to address this challenge. Although effective, these methods rely on weak annotations or additional multi-modal data like images, and the potential benefits inherent in the point cloud sequence are still underexplored. To this end, we explore the feasibility of self-supervised motion prediction with only unlabeled LiDAR point clouds. Initially, we employ an optimal transport solver to establish coarse correspondences between current and future point clouds as the coarse pseudo motion labels. Training models directly using such coarse labels leads to noticeable spatial and temporal prediction inconsistencies. To mitigate these issues, we introduce three simple spatial and temporal regularization losses, which facilitate the self-supervised training process effectively. Experimental results demonstrate the significant superiority of our approach over the state-of-the-art self-supervised methods.
comment: Accepted by CVPR2024
♻ ☆ ColonNeRF: High-Fidelity Neural Reconstruction of Long Colonoscopy
Colonoscopy reconstruction is pivotal for diagnosing colorectal cancer. However, accurate long-sequence colonoscopy reconstruction faces three major challenges: (1) dissimilarity among segments of the colon due to its meandering and convoluted shape; (2) co-existence of simple and intricately folded geometry structures; (3) sparse viewpoints due to constrained camera trajectories. To tackle these challenges, we introduce a new reconstruction framework based on neural radiance field (NeRF), named ColonNeRF, which leverages neural rendering for novel view synthesis of long-sequence colonoscopy. Specifically, to reconstruct the entire colon in a piecewise manner, our ColonNeRF introduces a region division and integration module, effectively reducing shape dissimilarity and ensuring geometric consistency in each segment. To learn both the simple and complex geometry in a unified framework, our ColonNeRF incorporates a multi-level fusion module that progressively models the colon regions from easy to hard. Additionally, to overcome the challenges from sparse views, we devise a DensiNet module for densifying camera poses under the guidance of semantic consistency. We conduct extensive experiments on both synthetic and real-world datasets to evaluate our ColonNeRF. Quantitatively, ColonNeRF exhibits a 67%-85% increase in LPIPS-ALEX scores. Qualitatively, our reconstruction visualizations show much clearer textures and more accurate geometric details. These sufficiently demonstrate our superior performance over the state-of-the-art methods.
comment: for Project Page, see https://showlab.github.io/ColonNeRF/
♻ ☆ Neuromorphic Imaging and Classification with Graph Learning
Bio-inspired neuromorphic cameras asynchronously record pixel brightness changes and generate sparse event streams. They can capture dynamic scenes with little motion blur and more details in extreme illumination conditions. Due to the multidimensional address-event structure, most existing vision algorithms cannot properly handle asynchronous event streams. While several event representations and processing methods have been developed to address such an issue, they are typically driven by a large number of events, leading to substantial overheads in runtime and memory. In this paper, we propose a new graph representation of the event data and couple it with a Graph Transformer to perform accurate neuromorphic classification. Extensive experiments show that our approach leads to better results and excels at the challenging realistic situations where only a small number of events and limited computational resources are available, paving the way for neuromorphic applications embedded into mobile facilities.
comment: 15 pages, 4 figures, and 7 tables. Accepted by Elsevier Neurocomputing
♻ ☆ BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image CVPR 2024
Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes, some recent work has tackled the reconstruction of hand textures on top of shapes. However, these methods are often limited to capturing pixels on the visible side of a hand, requiring diverse views of the hand in a video or multiple images as input. In this paper, we propose a novel method, BiTT(Bi-directional Texture reconstruction of Two hands), which is the first end-to-end trainable method for relightable, pose-free texture reconstruction of two interacting hands taking only a single RGB image, by three novel components: 1) bi-directional (left $\leftrightarrow$ right) texture reconstruction using the texture symmetry of left / right hands, 2) utilizing a texture parametric model for hand texture recovery, and 3) the overall coarse-to-fine stage pipeline for reconstructing personalized texture of two interacting hands. BiTT first estimates the scene light condition and albedo image from an input image, then reconstructs the texture of both hands through the texture parametric model and bi-directional texture reconstructor. In experiments using InterHand2.6M and RGB2Hands datasets, our method significantly outperforms state-of-the-art hand texture reconstruction methods quantitatively and qualitatively. The code is available at https://github.com/yunminjin2/BiTT
comment: Accepted by CVPR 2024
♻ ☆ An explainable three dimension framework to uncover learning patterns: A unified look in variable sulci recognition
Explainable AI is crucial in medical imaging. In the challenging field of neuroscience, visual topics present a high level of complexity, particularly within three-dimensional space. The application of neuroscience, which involves identifying brain sulcal features from MRI, faces significant hurdles due to varying annotation protocols among experts and the intricate three-dimension functionality of the brain. Consequently, traditional explainability approaches fall short in effectively validating and evaluating these networks. To address this, we first present a mathematical formulation delineating various categories of explanation needs across diverse computer vision tasks, categorized into self-explanatory, semi-explanatory, non-explanatory, and new-pattern learning applications based on the reliability of the validation protocol. With respect to this mathematical formulation, we propose a 3D explainability framework aimed at validating the outputs of deep learning networks in detecting the paracingulate sulcus an essential brain anatomical feature. The framework integrates local 3D explanations, global explanations through dimensionality reduction, concatenated global explanations, and statistical shape features, unveiling new insights into pattern learning. We trained and tested two advanced 3D deep learning networks on the challenging TOP-OSLO dataset, significantly improving sulcus detection accuracy, particularly on the left hemisphere. During evaluation with diverse annotation protocols for this dataset, we highlighted the crucial role of an unbiased annotation process in achieving precise predictions and effective pattern learning within our proposed 3D framework. The proposed framework not only annotates the variable sulcus but also uncovers hidden AI knowledge, promising to advance our understanding of brain anatomy and function.
♻ ☆ A Generative Approach for Wikipedia-Scale Visual Entity Recognition CVPR2024
In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.
comment: CVPR2024
♻ ☆ Analyzing Local Representations of Self-supervised Vision Transformers
In this paper, we present a comparative analysis of various self-supervised Vision Transformers (ViTs), focusing on their local representative power. Inspired by large language models, we examine the abilities of ViTs to perform various computer vision tasks with little to no fine-tuning. We design evaluation framework to analyze the quality of local, i.e.\ patch-level, representations in the context of few-shot semantic segmentation, instance identification, object retrieval and tracking. We discover that contrastive learning based methods like DINO produce more universal patch representations that can be immediately applied for downstream tasks with no parameter tuning, compared to masked image modeling. The embeddings learned using the latter approach, e.g. in masked autoencoders, have high variance features that harm distance-based algorithms, such as k-NN, and do not contain useful information for most downstream tasks. Furthermore, we demonstrate that removing these high-variance features enhances k-NN for MAE, as well as for its recent extension Scale-MAE. Finally, we find an object instance retrieval setting where DINOv2, a model pretrained on two orders of magnitude more data, falls short of its less compute intensive counterpart DINO.
♻ ☆ Enhanced Few-Shot Class-Incremental Learning via Ensemble Models
Few-shot class-incremental learning (FSCIL) aims to continually fit new classes with limited training data, while maintaining the performance of previously learned classes. The main challenges are overfitting the rare new training samples and forgetting old classes. While catastrophic forgetting has been extensively studied, the overfitting problem has attracted less attention in FSCIL. To tackle overfitting challenge, we design a new ensemble model framework cooperated with data augmentation to boost generalization. In this way, the enhanced model works as a library storing abundant features to guarantee fast adaptation to downstream tasks. Specifically, the multi-input multi-output ensemble structure is applied with a spatial-aware data augmentation strategy, aiming at diversifying the feature extractor and alleviating overfitting in incremental sessions. Moreover, self-supervised learning is also integrated to further improve the model generalization. Comprehensive experimental results show that the proposed method can indeed mitigate the overfitting problem in FSCIL, and outperform the state-of-the-art methods.
♻ ☆ Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation CVPR 2024
Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve segmentation tasks without dense annotations. However, attributed to the frequent coupling of co-occurring objects and the limited supervision from image-level labels, the challenging co-occurrence problem is widely present and leads to false activation of objects in WSSS. In this work, we devise a 'Separate and Conquer' scheme SeCo to tackle this issue from dimensions of image space and feature space. In the image space, we propose to 'separate' the co-occurring objects with image decomposition by subdividing images into patches. Importantly, we assign each patch a category tag from Class Activation Maps (CAMs), which spatially helps remove the co-context bias and guide the subsequent representation. In the feature space, we propose to 'conquer' the false activation by enhancing semantic representation with multi-granularity knowledge contrast. To this end, a dual-teacher-single-student architecture is designed and tag-guided contrast is conducted, which guarantee the correctness of knowledge and further facilitate the discrepancy among co-contexts. We streamline the multi-staged WSSS pipeline end-to-end and tackle this issue without external supervision. Extensive experiments are conducted, validating the efficiency of our method and the superiority over previous single-staged and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available at https://github.com/zwyang6/SeCo.git.
comment: Accepted by CVPR 2024
♻ ☆ Visually-Aware Context Modeling for News Image Captioning NAACL 2024
News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC.
comment: Accepted at NAACL 2024 Main Conference
♻ ☆ An Active Contour Model Driven By the Hybrid Signed Pressure Function
Due to the influence of imaging equipment and complex imaging environments, most images in daily life have features of intensity inhomogeneity and noise. Therefore, many scholars have designed many image segmentation algorithms to address these issues. Among them, the active contour model is one of the most effective image segmentation algorithms.This paper proposes an active contour model driven by the hybrid signed pressure function that combines global and local information construction. Firstly, a new global region-based signed pressure function is introduced by combining the average intensity of the inner and outer regions of the curve with the median intensity of the inner region of the evolution curve. Then, the paper uses the energy differences between the inner and outer regions of the curve in the local region to design the signed pressure function of the local term. Combine the two SPF function to obtain a new signed pressure function and get the evolution equation of the new model. Finally, experiments and numerical analysis show that the model has excellent segmentation performance for both intensity inhomogeneous images and noisy images.
♻ ☆ Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion CVPR 2024
Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that, our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.
comment: CVPR 2024 camera ready, including more evaluations and discussions. Project webpage: https://nju-3dv.github.io/projects/direct25
♻ ☆ Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps ICLR2024
Diffusion Probabilistic Models (DPM) have shown remarkable efficacy in the synthesis of high-quality images. However, their inference process characteristically requires numerous, potentially hundreds, of iterative steps, which could exaggerate the problem of exposure bias due to the training and inference discrepancy. Previous work has attempted to mitigate this issue by perturbing inputs during training, which consequently mandates the retraining of the DPM. In this work, we conduct a systematic study of exposure bias in DPM and, intriguingly, we find that the exposure bias could be alleviated with a novel sampling method that we propose, without retraining the model. We empirically and theoretically show that, during inference, for each backward time step $t$ and corresponding state $\hat{x}_t$, there might exist another time step $t_s$ which exhibits superior coupling with $\hat{x}_t$. Based on this finding, we introduce a sampling method named Time-Shift Sampler. Our framework can be seamlessly integrated to existing sampling algorithms, such as DDPM, DDIM and other high-order solvers, inducing merely minimal additional computations. Experimental results show our method brings significant and consistent improvements in FID scores on different datasets and sampling methods. For example, integrating Time-Shift Sampler to F-PNDM yields a FID=3.88, achieving 44.49\% improvements as compared to F-PNDM, on CIFAR-10 with 10 sampling steps, which is more performant than the vanilla DDIM with 100 sampling steps. Our code is available at https://github.com/Mingxiao-Li/TS-DPM.
comment: Accepted at International Conference on Learning Representations (ICLR2024)
♻ ☆ To use or not to use proprietary street view images in (health and place) research? That is the question
Computer vision-based analysis of street view imagery has transformative impacts on environmental assessments. Interactive web services, particularly Google Street View, play an ever-important role in making imagery data ubiquitous. Despite the technical ease of harnessing millions of Google Street View images, this article questions the current practices in using this proprietary data source from a European viewpoint. Our concern lies with Google's terms of service, which restrict bulk image downloads and the generation of street view image-based indices. To reconcile the challenge of advancing society through groundbreaking research while maintaining data license agreements and legal integrity, we believe it is crucial to 1) include an author's statement on using proprietary street view data and the directives it entails, 2) negotiate academic-specific license to democratize Google Street View data access, and 3) adhere to open data principles and utilize open image sources for future research.
♻ ☆ Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training CVPR 2024
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.
comment: Accepted at CVPR 2024. 13 pages, 4 figures
♻ ☆ AI-KD: Adversarial learning and Implicit regularization for self-Knowledge Distillation
We present a novel adversarial penalized self-knowledge distillation method, named adversarial learning and implicit regularization for self-knowledge distillation (AI-KD), which regularizes the training procedure by adversarial learning and implicit distillations. Our model not only distills the deterministic and progressive knowledge which are from the pre-trained and previous epoch predictive probabilities but also transfers the knowledge of the deterministic predictive distributions using adversarial learning. The motivation is that the self-knowledge distillation methods regularize the predictive probabilities with soft targets, but the exact distributions may be hard to predict. Our method deploys a discriminator to distinguish the distributions between the pre-trained and student models while the student model is trained to fool the discriminator in the trained procedure. Thus, the student model not only can learn the pre-trained model's predictive probabilities but also align the distributions between the pre-trained and student models. We demonstrate the effectiveness of the proposed method with network architectures on multiple datasets and show the proposed method achieves better performance than state-of-the-art methods.
comment: Accepted to KBS
♻ ☆ Consistency Enhancement-Based Deep Multiview Clustering via Contrastive Learning
Multiview clustering (MVC) segregates data samples into meaningful clusters by synthesizing information across multiple views. Moreover, deep learning-based methods have demonstrated their strong feature learning capabilities in MVC scenarios. However, effectively generalizing feature representations while maintaining consistency is still an intractable problem. In addition, most existing deep clustering methods based on contrastive learning overlook the consistency of the clustering representations during the clustering process. In this paper, we show how the above problems can be overcome and propose a consistent enhancement-based deep MVC method via contrastive learning (CCEC). Specifically, semantic connection blocks are incorporated into a feature representation to preserve the consistent information among multiple views. Furthermore, the representation process for clustering is enhanced through spectral clustering, and the consistency across multiple views is improved. Experiments conducted on five datasets demonstrate the effectiveness and superiority of our method in comparison with the state-of-the-art (SOTA) methods. The code for this method can be accessed at https://anonymous.4open.science/r/CCEC-E84E/.
comment: There are multiple errors that need to be corrected, including some formulas and concept descriptions. We will re upload the paper after the modifications are completed
♻ ☆ LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment
Language-guided scene-aware human motion generation has great significance for entertainment and robotics. In response to the limitations of existing datasets, we introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation, and can also facilitate the development of real-life applications. Moreover, to generate semantically consistent and physically plausible human motions, we propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.
♻ ☆ LLM4SGG: Large Language Model for Weakly Supervised Scene Graph Generation CVPR 2024
Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.
comment: 8 pages; CVPR 2024
♻ ☆ Intrinsic Image Diffusion for Indoor Single-view Material Estimation
We present Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes. Given a single input view, we sample multiple possible material explanations represented as albedo, roughness, and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue, we advocate for a probabilistic formulation, where instead of attempting to directly predict the true material properties, we employ a conditional generative model to sample from the solution space. Furthermore, we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper, more consistent, and more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.
comment: Project page: https://peter-kocsis.github.io/IntrinsicImageDiffusion/ Video: https://youtu.be/lz0meJlj5cA
♻ ☆ Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision
With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning rotated box (RBox) from the horizontal box (HBox) has attracted more and more attention. In this paper, we explore a more challenging yet label-efficient setting, namely single point-supervised OOD, and present our approach called Point2RBox. Specifically, we propose to leverage two principles: 1) Synthetic pattern knowledge combination: By sampling around each labeled point on the image, we spread the object feature to synthetic visual patterns with known boxes to provide the knowledge for box regression. 2) Transform self-supervision: With a transformed input image (e.g. scaled/rotated), the output RBoxes are trained to follow the same transformation so that the network can perceive the relative size/rotation between objects. The detector is further enhanced by a few devised techniques to cope with peripheral issues, e.g. the anchor/layer assignment as the size of the object is not available in our point supervision setting. To our best knowledge, Point2RBox is the first end-to-end solution for point-supervised OOD. In particular, our method uses a lightweight paradigm, yet it achieves a competitive performance among point-supervised alternatives, 41.05%/27.62%/80.01% on DOTA/DIOR/HRSC datasets.
comment: 10 pages, 3 figures, 5 tables, code: https://github.com/yuyi1005/point2rbox-mmrotate
♻ ☆ Driving Animatronic Robot Facial Expression From Speech
Animatronic robots aim to enable natural human-robot interaction through lifelike facial expressions. However, generating realistic, speech-synchronized robot expressions is challenging due to the complexities of facial biomechanics and responsive motion synthesis. This paper presents a principled, skinning-centric approach to drive animatronic robot facial expressions from speech. The proposed approach employs linear blend skinning (LBS) as the core representation to guide tightly integrated innovations in embodiment design and motion synthesis. LBS informs the actuation topology, enables human expression retargeting, and allows speech-driven facial motion generation. The proposed approach is capable of generating highly realistic, real-time facial expressions from speech on an animatronic face, significantly advancing robots' ability to replicate nuanced human expressions for natural interaction.
comment: Under review. For associated project page, see https://library87.github.io/animatronic-face-iros24
♻ ☆ Mpox-AISM: AI-Mediated Super Monitoring for Mpox and Like-Mpox
The key to preventing the spread of mpox (monkeypox) lies in timely, convenient, and accurate diagnosis for earlier-stage infected individuals. Unfortunately, the resemblances between common skin diseases and mpox and the need for professional diagnosis inevitably deteriorated the diagnosis of earlier-stage patients with Mpox and contributed to its widespread outbreak in crowded areas. Here, we proposed a real-time visualization strategy called "Super Monitoring" using artificial intelligence and Internet technology, thereby performing a low-cost, convenient, timely, and unspecialized diagnosis for earlier-stage mpox. Specifically, such AI-mediated "super monitoring" (Mpox-AISM) invokes a framework assembled by deep learning models, data augmentation, self-supervised learning, and cloud services. Verified by publicly available datasets, the Precision, Recall, Specificity, and F1-score of Mpox-AISM in diagnosing mpox achieved 99.3%, 94.1%, 99.9%, and 96.6%, respectively. Furthermore, Mpox-AISM's overall accuracy reaches 94.51% in diagnosing mpox, six like-mpox skin diseases, and normal skin. We also employed gradient-weighted class activation mapping to explain the decision-making process of Mpox-AISM, thus handily understanding the specific characteristics that may indicate the mpox's onset and improving its reliability. With the help of the Internet and communication terminal, Mpox-AISM can perform a real-time, low-cost, and convenient diagnosis for earlier-stage mpox in various real-world settings, thereby effectively curbing the spread of mpox virus.
♻ ☆ Open-Vocabulary Camouflaged Object Segmentation
Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works have explored the utilization of pre-trained VLM for the challenging open-vocabulary dense prediction task that requires perceiving diverse objects with novel classes at inference time. Existing methods construct experiments based on the public datasets of related tasks, which are not tailored for open vocabulary and rarely involve imperceptible objects camouflaged in complex scenes due to data collection bias and annotation costs. To fill in the gaps, we introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS), and construct a large-scale complex scene dataset (\textbf{OVCamo}) containing 11,483 hand-selected images with fine annotations and corresponding object classes. Further, we build a strong single-stage open-vocabulary \underline{c}amouflaged \underline{o}bject \underline{s}egmentation transform\underline{er} baseline \textbf{OVCoser} attached to the parameter-fixed CLIP with iterative semantic guidance and structure enhancement. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects. Moreover, this effective framework also surpasses previous state-of-the-arts of open-vocabulary semantic image segmentation by a large margin on our OVCamo dataset. With the proposed dataset and baseline, we hope that this new task with more practical value can further expand the research on open-vocabulary dense prediction tasks. The code and data will be available in the future.
comment: Update the style and add details
♻ ☆ Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos ICCV 2023
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at https://github.com/shvdiwnkozbw/SMTC.
comment: ICCV 2023
♻ ☆ ICP-Flow: LiDAR Scene Flow Estimation with ICP CVPR 2024
Scene flow characterizes the 3D motion between two LiDAR scans captured by an autonomous vehicle at nearby timesteps. Prevalent methods consider scene flow as point-wise unconstrained flow vectors that can be learned by either large-scale training beforehand or time-consuming optimization at inference. However, these methods do not take into account that objects in autonomous driving often move rigidly. We incorporate this rigid-motion assumption into our design, where the goal is to associate objects over scans and then estimate the locally rigid transformations. We propose ICP-Flow, a learning-free flow estimator. The core of our design is the conventional Iterative Closest Point (ICP) algorithm, which aligns the objects over time and outputs the corresponding rigid transformations. Crucially, to aid ICP, we propose a histogram-based initialization that discovers the most likely translation, thus providing a good starting point for ICP. The complete scene flow is then recovered from the rigid transformations. We outperform state-of-the-art baselines, including supervised models, on the Waymo dataset and perform competitively on Argoverse-v2 and nuScenes. Further, we train a feedforward neural network, supervised by the pseudo labels from our model, and achieve top performance among all models capable of real-time inference. We validate the advantage of our model on scene flow estimation with longer temporal gaps, up to 0.4 seconds where other models fail to deliver meaningful results.
comment: CVPR 2024, camera-ready. Code: https://github.com/yanconglin/ICP-Flow
♻ ☆ GSVA: Generalized Segmentation via Multimodal Large Language Models CVPR2024
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically, GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask references simultaneously and innovatively learns to generate a [REJ] token to reject the null targets explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue, marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring segmentation and comprehension tasks.
comment: Accepted by CVPR2024 (19 pages, 9 figures, 11 tables)
♻ ☆ RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation ICLR
Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the prior SOTA method in both inference speed and separation quality while reducing the number of parameters by 90% and MACs by 83%. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
comment: Accepted by The Twelfth International Conference on Learning Representations (ICLR) 2024, see https://openreview.net/forum?id=PEuDO2EiDr
♻ ☆ Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation CVPR 2024
Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one expression must refer to object-level targets. In this paper, we take a step further to finer-grained part-level RES task. To promote the object-level RES task towards finer-grained vision-language understanding, we put forward a new multi-granularity referring expression segmentation (MRES) task and construct an evaluation benchmark called RefCOCOm by manual annotations. By employing our automatic model-assisted data engine, we build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions on the provided 1M images. Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task. Extensive experiments on our RefCOCOm for MRES and three datasets (i.e., RefCOCO(+/g) for classic RES task demonstrate the superiority of our method over previous state-of-the-art methods. To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github.com/Rubics-Xuan/MRES
comment: This work is accepted by CVPR 2024
♻ ☆ FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs
Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Intriguingly, our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate. Overall, our observations inspire a further understanding of multi-modal robustness and safety-related feature alignment across different modalities, which is of great importance for various large multi-modal models. Our code is available at https://github.com/THU-Kingmin/FMM-Attack.
♻ ☆ CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection
Recently, segmentation-based methods are quite popular in scene text detection, which mainly contain two steps: text kernel segmentation and expansion. However, the segmentation process only considers each pixel independently, and the expansion process is difficult to achieve a favorable accuracy-speed trade-off. In this paper, we propose a Context-aware and Boundary-guided Network (CBN) to tackle these problems. In CBN, a basic text detector is firstly used to predict initial segmentation results. Then, we propose a context-aware module to enhance text kernel feature representations, which considers both global and local contexts. Finally, we introduce a boundary-guided module to expand enhanced text kernels adaptively with only the pixels on the contours, which not only obtains accurate text boundaries but also keeps high speed, especially on high-resolution output maps. In particular, with a lightweight backbone, the basic detector equipped with our proposed CBN achieves state-of-the-art results on several popular benchmarks, and our proposed CBN can be plugged into several segmentation-based methods. Code is available at https://github.com/XiiZhao/cbn.pytorch.
comment: Accepted by IJCV 2024. Code is available at this https URL: https://github.com/XiiZhao/cbn.pytorch
♻ ☆ Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model
The recent Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, SAM often struggles when handling various unconventional images, such as aerial, medical, and non-RGB images. This paper presents CAT-SAM, a ConditionAl Tuning network that adapts SAM toward various unconventional target tasks with just few-shot target samples. CAT-SAM freezes the entire SAM and adapts its mask decoder and image encoder simultaneously with a small number of learnable parameters. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the prompt token of the mask decoder to the image encoder, fostering synergic adaptation of the encoder and the decoder with mutual benefits. We develop two representative tuning strategies for the image encoder which leads to two CAT-SAM variants: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 unconventional tasks show that both CAT-SAM variants achieve superior target segmentation performance consistently even under the very challenging one-shot adaptation setup. Project page: https://xiaoaoran.github.io/projects/CAT-SAM
comment: Project page: https://xiaoaoran.github.io/projects/CAT-SAM
♻ ☆ ShaDocFormer: A Shadow-Attentive Threshold Detector With Cascaded Fusion Refiner for Document Shadow Removal IJCNN 2024
Document shadow is a common issue that arises when capturing documents using mobile devices, which significantly impacts readability. Current methods encounter various challenges, including inaccurate detection of shadow masks and estimation of illumination. In this paper, we propose ShaDocFormer, a Transformer-based architecture that integrates traditional methodologies and deep learning techniques to tackle the problem of document shadow removal. The ShaDocFormer architecture comprises two components: the Shadow-attentive Threshold Detector (STD) and the Cascaded Fusion Refiner (CFR). The STD module employs a traditional thresholding technique and leverages the attention mechanism of the Transformer to gather global information, thereby enabling precise detection of shadow masks. The cascaded and aggregative structure of the CFR module facilitates a coarse-to-fine restoration process for the entire image. As a result, ShaDocFormer excels in accurately detecting and capturing variations in both shadow and illumination, thereby enabling effective removal of shadows. Extensive experiments demonstrate that ShaDocFormer outperforms current state-of-the-art methods in both qualitative and quantitative measurements.
comment: Accepted by IJCNN 2024
♻ ☆ MinD-3D: Reconstruct High-quality 3D objects in Human Brain
In this paper, we introduce Recon3DMind, an innovative task aimed at reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals, marking a significant advancement in the fields of cognitive neuroscience and computer vision. To support this pioneering task, we present the fMRI-Shape dataset, which includes data from 14 participants and features 360-degree videos of 3D objects to enable comprehensive fMRI signal capture across various settings, thereby laying a foundation for future research. Furthermore, we propose MinD-3D, a novel and effective three-stage framework specifically designed to decode the brain's 3D visual information from fMRI signals, demonstrating the feasibility of this challenging task. The framework begins by extracting and aggregating features from fMRI frames through a neuro-fusion encoder, subsequently employs a feature bridge diffusion model to generate visual features, and ultimately recovers the 3D object via a generative transformer decoder. We assess the performance of MinD-3D using a suite of semantic and structural metrics and analyze the correlation between the features extracted by our model and the visual regions of interest (ROIs) in fMRI signals. Our findings indicate that MinD-3D not only reconstructs 3D objects with high semantic relevance and spatial similarity but also significantly enhances our understanding of the human brain's capabilities in processing 3D visual information. Project page at: https://jianxgao.github.io/MinD-3D.
comment: 26 pages, 13 figures
♻ ☆ Neural Markov Random Field for Stereo Matching CVPR 2024
Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods, the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models, the overall accuracy is still severely limited by the hand-crafted pairwise terms and message passing. To address these issues, we propose a neural MRF model, where both potential functions and message passing are designed using data-driven neural networks. Our fully data-driven model is built on the foundation of variational inference theory, to prevent convergence issues and retain stereo MRF's graph inductive bias. To make the inference tractable and scale well to high-resolution images, we also propose a Disparity Proposal Network (DPN) to adaptively prune the search space of disparity. The proposed approach ranks $1^{st}$ on both KITTI 2012 and 2015 leaderboards among all published methods while running faster than 100 ms. This approach significantly outperforms prior global methods, e.g., lowering D1 metric by more than 50% on KITTI 2015. In addition, our method exhibits strong cross-domain generalization and can recover sharp edges. The codes at https://github.com/aeolusguan/NMRF
comment: Accepted to CVPR 2024
♻ ☆ Active Prompt Learning in Vision Language Models CVPR 2024
Pre-trained Vision Language Models (VLMs) have demonstrated notable progress in various zero-shot tasks, such as classification and retrieval. Despite their performance, because improving performance on new tasks requires task-specific knowledge, their adaptation is essential. While labels are needed for the adaptation, acquiring them is typically expensive. To overcome this challenge, active learning, a method of achieving a high performance by obtaining labels for a small number of samples from experts, has been studied. Active learning primarily focuses on selecting unlabeled samples for labeling and leveraging them to train models. In this study, we pose the question, "how can the pre-trained VLMs be adapted under the active learning framework?" In response to this inquiry, we observe that (1) simply applying a conventional active learning framework to pre-trained VLMs even may degrade performance compared to random selection because of the class imbalance in labeling candidates, and (2) the knowledge of VLMs can provide hints for achieving the balance before labeling. Based on these observations, we devise a novel active learning framework for VLMs, denoted as PCB. To assess the effectiveness of our approach, we conduct experiments on seven different real-world datasets, and the results demonstrate that PCB surpasses conventional active learning and random sampling methods. Code will be available in https://github.com/kaist-dmlab/pcb .
comment: accepted at CVPR 2024
♻ ☆ LMM-Assisted Breast Cancer Treatment Target Segmentation with Consistency Embedding
Recent advancements in Artificial Intelligence (AI) have profoundly influenced medical fields, by providing tools to reduce clinical workloads. However, most AI models are constrained to execute unimodal tasks, in stark contrast to the comprehensive approaches utilized by medical professionals. To address this, here we present RO-LMM, a multi-purpose large multimodal model (LMM) tailored for the field of radiation oncology. This model covers series of tasks within clinical workflow, adept at clinical report summarization, radiation treatment plan suggestion, and plan-guided target volume segmentation. In particular, to perform consecutive clinical tasks, we further present a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which boosts LMM's robustness to noisy inputs while preserving the capability of handling clean inputs, and transform this concept into LMM-driven segmentation framework as Consistency Embedding Segmentation~(CESEG). Experimental results on multi-centre cohorts demonstrate our RO-LMM's promising performance for multiple clinical tasks with generalization capabilities.
comment: 30 pages, 16 table, 5 figures
♻ ☆ NocPlace: Nocturnal Visual Place Recognition via Generative and Inherited Knowledge Transfer
Visual Place Recognition (VPR) is crucial in computer vision, aiming to retrieve database images similar to a query image from an extensive collection of known images. However, like many vision tasks, VPR always degrades at night due to the scarcity of nighttime images. Moreover, VPR needs to address the cross-domain problem of night-to-day rather than just the issue of a single nighttime domain. In response to these issues, we present NocPlace, which leverages generative and inherited knowledge transfer to embed resilience against dazzling lights and extreme darkness in the global descriptor. First, we establish a day-night urban scene dataset called NightCities, capturing diverse lighting variations and dark scenarios across 60 cities globally. Then, an image generation network is trained on this dataset and processes a large-scale VPR dataset, obtaining its nighttime version. Finally, VPR models are fine-tuned using descriptors inherited from themselves and night-style images, which builds explicit cross-domain contrastive relationships. Comprehensive experiments on various datasets demonstrate our contributions and the superiority of NocPlace. Without adding any real-time computing resources, NocPlace improves the performance of Eigenplaces by 7.6% on Tokyo 24/7 Night and 16.8% on SVOX Night.
comment: 28 pages,9 figures
♻ ☆ Dodging DeepFake Detection via Implicit Spatial-Domain Notch Filtering
The current high-fidelity generation and high-precision detection of DeepFake images are at an arms race. We believe that producing DeepFakes that are highly realistic and 'detection evasive' can serve the ultimate goal of improving future generation DeepFake detection capabilities. In this paper, we propose a simple yet powerful pipeline to reduce the artifact patterns of fake images without hurting image quality by performing implicit spatial-domain notch filtering. We first demonstrate that frequency-domain notch filtering, although famously shown to be effective in removing periodic noise in the spatial domain, is infeasible for our task at hand due to the manual designs required for the notch filters. We, therefore, resort to a learning-based approach to reproduce the notch filtering effects, but solely in the spatial domain. We adopt a combination of adding overwhelming spatial noise for breaking the periodic noise pattern and deep image filtering to reconstruct the noise-free fake images, and we name our method DeepNotch. Deep image filtering provides a specialized filter for each pixel in the noisy image, producing filtered images with high fidelity compared to their DeepFake counterparts. Moreover, we also use the semantic information of the image to generate an adversarial guidance map to add noise intelligently. Our large-scale evaluation on 3 representative state-of-the-art DeepFake detection methods (tested on 16 types of DeepFakes) has demonstrated that our technique significantly reduces the accuracy of these 3 fake image detection methods, 36.79% on average and up to 97.02% in the best case.
comment: 14 pages
Information Retrieval
☆ Knowledge-Enhanced Recommendation with User-Centric Subgraph Network
Recommendation systems, as widely implemented nowadays on various platforms, recommend relevant items to users based on their preferences. The classical methods which rely on user-item interaction matrices has limitations, especially in scenarios where there is a lack of interaction data for new items. Knowledge graph (KG)-based recommendation systems have emerged as a promising solution. However, most KG-based methods adopt node embeddings, which do not provide personalized recommendations for different users and cannot generalize well to the new items. To address these limitations, we propose Knowledge-enhanced User-Centric subgraph Network (KUCNet), a subgraph learning approach with graph neural network (GNN) for effective recommendation. KUCNet constructs a U-I subgraph for each user-item pair that captures both the historical information of user-item interactions and the side information provided in KG. An attention-based GNN is designed to encode the U-I subgraphs for recommendation. Considering efficiency, the pruned user-centric computation graph is further introduced such that multiple U-I subgraphs can be simultaneously computed and that the size can be pruned by Personalized PageRank. Our proposed method achieves accurate, efficient, and interpretable recommendations especially for new items. Experimental results demonstrate the superiority of KUCNet over state-of-the-art KG-based and collaborative filtering (CF)-based methods.
☆ FIT-RAG: Black-Box RAG with Factual Information and Token Reduction
Due to the extraordinarily large number of parameters, fine-tuning Large Language Models (LLMs) to update long-tail or out-of-date knowledge is impractical in lots of applications. To avoid fine-tuning, we can alternatively treat a LLM as a black-box (i.e., freeze the parameters of the LLM) and augment it with a Retrieval-Augmented Generation (RAG) system, namely black-box RAG. Recently, black-box RAG has achieved success in knowledge-intensive tasks and has gained much attention. Existing black-box RAG methods typically fine-tune the retriever to cater to LLMs' preferences and concatenate all the retrieved documents as the input, which suffers from two issues: (1) Ignorance of Factual Information. The LLM preferred documents may not contain the factual information for the given question, which can mislead the retriever and hurt the effectiveness of black-box RAG; (2) Waste of Tokens. Simply concatenating all the retrieved documents brings large amounts of unnecessary tokens for LLMs, which degenerates the efficiency of black-box RAG. To address these issues, this paper proposes a novel black-box RAG framework which utilizes the factual information in the retrieval and reduces the number of tokens for augmentation, dubbed FIT-RAG. FIT-RAG utilizes the factual information by constructing a bi-label document scorer. Besides, it reduces the tokens by introducing a self-knowledge recognizer and a sub-document-level token reducer. FIT-RAG achieves both superior effectiveness and efficiency, which is validated by extensive experiments across three open-domain question-answering datasets: TriviaQA, NQ and PopQA. FIT-RAG can improve the answering accuracy of Llama2-13B-Chat by 14.3\% on TriviaQA, 19.9\% on NQ and 27.5\% on PopQA, respectively. Furthermore, it can save approximately half of the tokens on average across the three datasets.
☆ Understanding the Ranking Loss for Recommendation with Sparse User Feedback
Click-through rate (CTR) prediction holds significant importance in the realm of online advertising. While many existing approaches treat it as a binary classification problem and utilize binary cross entropy (BCE) as the optimization objective, recent advancements have indicated that combining BCE loss with ranking loss yields substantial performance improvements. However, the full efficacy of this combination loss remains incompletely understood. In this paper, we uncover a new challenge associated with BCE loss in scenarios with sparse positive feedback, such as CTR prediction: the gradient vanishing for negative samples. Subsequently, we introduce a novel perspective on the effectiveness of ranking loss in CTR prediction, highlighting its ability to generate larger gradients on negative samples, thereby mitigating their optimization issues and resulting in improved classification ability. Our perspective is supported by extensive theoretical analysis and empirical evaluation conducted on publicly available datasets. Furthermore, we successfully deployed the ranking loss in Tencent's online advertising system, achieving notable lifts of 0.70% and 1.26% in Gross Merchandise Value (GMV) for two main scenarios. The code for our approach is openly accessible at the following GitHub repository: https://github.com/SkylerLinn/Understanding-the-Ranking-Loss.
☆ M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval LREC
In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal retrieval performance. On the other hand, despite many retrieval datasets supporting various learning objectives beyond contrastive learning, combining them efficiently in multi-task learning scenarios can be challenging. In this paper, we introduce M3, an advanced recursive Multi-hop dense sentence retrieval system built upon a novel Multi-task Mixed-objective approach for dense text representation learning, addressing the aforementioned challenges. Our approach yields state-of-the-art performance on a large-scale open-domain fact verification benchmark dataset, FEVER. Code and data are available at: https://github.com/TonyBY/M3
comment: Accepted by LREC-COLING 2024
☆ gTBLS: Generating Tables from Text by Conditional Question Answering
Distilling large, unstructured text into a structured, condensed form such as tables is an open research problem. One of the primary challenges in automatically generating tables is ensuring their syntactic validity. Prior approaches address this challenge by including additional parameters in the Transformer's attention mechanism to attend to specific rows and column headers. In contrast to this single-stage method, this paper presents a two-stage approach called Generative Tables (gTBLS). The first stage infers table structure (row and column headers) from the text. The second stage formulates questions using these headers and fine-tunes a causal language model to answer them. Furthermore, the gTBLS approach is amenable to the utilization of pre-trained Large Language Models in a zero-shot configuration, presenting a solution for table generation in situations where fine-tuning is not feasible. gTBLS improves prior approaches by up to 10% in BERTScore on the table construction task and up to 20% on the table content generation task of the E2E, WikiTableText, WikiBio, and RotoWire datasets.
comment: 12 pages, 1 figure
☆ Evaluating the Performance of LLMs on Technical Language Processing tasks
In this paper we present the results of an evaluation study of the perfor-mance of LLMs on Technical Language Processing tasks. Humans are often confronted with tasks in which they have to gather information from dispar-ate sources and require making sense of large bodies of text. These tasks can be significantly complex for humans and often require deep study including rereading portions of a text. Towards simplifying the task of gathering in-formation we evaluated LLMs with chat interfaces for their ability to provide answers to standard questions that a human can be expected to answer based on their reading of a body of text. The body of text under study is Title 47 of the United States Code of Federal Regulations (CFR) which describes regula-tions for commercial telecommunications as governed by the Federal Com-munications Commission (FCC). This has been a body of text of interest be-cause our larger research concerns the issue of making sense of information related to Wireless Spectrum Governance and usage in an automated manner to support Dynamic Spectrum Access. The information concerning this wireless spectrum domain is found in many disparate sources, with Title 47 of the CFR being just one of many. Using a range of LLMs and providing the required CFR text as context we were able to quantify the performance of those LLMs on the specific task of answering the questions below.
☆ Enhancing Medical Support in the Arabic Language Through Personalized ChatGPT Assistance
This Paper discusses the growing popularity of online medical diagnosis as an alternative to traditional doctor visits. It highlights the limitations of existing tools and emphasizes the advantages of using ChatGPT, which provides real-time, personalized medical diagnosis at no cost. The paragraph summarizes a research study that evaluated the performance of ChatGPT in Arabic medical diagnosis. The study involved compiling a dataset of disease information and generating multiple messages for each disease using different prompting techniques. ChatGPT's performance was assessed by measuring the similarity between its responses and the actual diseases. The results showed promising performance, with average scores of around 76% for similarity measures. Various prompting techniques were used, and chain prompting demonstrated a relative advantage. The study also recorded an average response time of 6.12 seconds for the ChatGPT API, which is considered acceptable but has room for improvement. While ChatGPT cannot replace human doctors entirely, the findings suggest its potential in emergency cases and addressing general medical inquiries. Overall, the study highlights ChatGPT's viability as a valuable tool in the medical field.
comment: This paper was presented at The International conference for Arabic language and applied linguistics
♻ ☆ EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
comment: Project website: https://zjunlp.github.io/project/EasyInstruct Code: https://github.com/zjunlp/EasyInstruct Video: https://youtu.be/rfQOWYfziFo Demo: https://huggingface.co/spaces/zjunlp/EasyInstruct
♻ ☆ Discrete Semantic Tokenization for Deep CTR Prediction
Incorporating item content information into click-through rate (CTR) prediction models remains a challenge, especially with the time and space constraints of industrial scenarios. The content-encoding paradigm, which integrates user and item encoders directly into CTR models, prioritizes space over time. In contrast, the embedding-based paradigm transforms item and user semantics into latent embeddings, subsequently caching them to optimize processing time at the expense of space. In this paper, we introduce a new semantic-token paradigm and propose a discrete semantic tokenization approach, namely UIST, for user and item representation. UIST facilitates swift training and inference while maintaining a conservative memory footprint. Specifically, UIST quantizes dense embedding vectors into discrete tokens with shorter lengths and employs a hierarchical mixture inference module to weigh the contribution of each user--item token pair. Our experimental results on news recommendation showcase the effectiveness and efficiency (about 200-fold space compression) of UIST for CTR prediction.
comment: TheWebConf 2024 accepted paper
♻ ☆ Pushing the Limits: Concurrency Detection in Acyclic Sound Free-Choice Workflow Nets in $O(P^2 + T^2)$
Concurrency is an important aspect of Petri nets to describe and simulate the behavior of complex systems. Knowing which places and transitions could be executed in parallel helps to understand nets and enables analysis techniques and the computation of other properties, such as causality, exclusivity, etc.. All techniques based on concurrency detection depend on the efficiency of this detection methodology. Kovalyov and Esparza have developed algorithms that compute all concurrent places in $O\big((P+T)TP^2\big)$ for live and bounded nets (where $P$ and $T$ are the numbers of places and transitions) and in $O\big(P(P+T)^2\big)$ for live and bounded free-choice nets. Although these algorithms have a reasonably good computational complexity, large numbers of concurrent pairs of nodes may still lead to long computation times. This paper complements the palette of concurrency detection algorithms with the Concurrent Paths (CP) algorithm for sound free-choice workflow nets. The algorithm allows parallelization and has a worst-case computational complexity of $O(P^2 + T^2)$ for acyclic nets and of $O(P^3 + PT^2)$ for cyclic nets. Although the computational complexity of cyclic nets has not improved, the evaluation shows the benefits of CP, especially, if the net contains many nodes in concurrency relation.
comment: 19 pages, 14 figures, 5 algorithms
♻ ☆ TensorBank: Tensor Lakehouse for Foundation Model Training
Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
Mathematical Software
☆ Extrapolating Solution Paths of Polynomial Homotopies towards Singularities with PHCpack and phcpy
A robust path tracker [Telen, Van Barel, Verschelde, SISC 2020] computes the radius of convergence of Newton's method, estimates the distance to the nearest path, and then applies Pad\'e approximants to predict the next point on the path. Apriori step size control is less sensitive to finely tuned tolerances than aposteriori step size control, and is therefore robust. Extrapolation methods are effective to accurately locate the singular points at the end of solution paths, as illustrated with phcpy, the scripting interface to PHCpack.
♻ ☆ CUQIpy: II. Computational uncertainty quantification for PDE-based inverse problems in Python
Inverse problems, particularly those governed by Partial Differential Equations (PDEs), are prevalent in various scientific and engineering applications, and uncertainty quantification (UQ) of solutions to these problems is essential for informed decision-making. This second part of a two-paper series builds upon the foundation set by the first part, which introduced CUQIpy, a Python software package for computational UQ in inverse problems using a Bayesian framework. In this paper, we extend CUQIpy's capabilities to solve PDE-based Bayesian inverse problems through a general framework that allows the integration of PDEs in CUQIpy, whether expressed natively or using third-party libraries such as FEniCS. CUQIpy offers concise syntax that closely matches mathematical expressions, streamlining the modeling process and enhancing the user experience. The versatility and applicability of CUQIpy to PDE-based Bayesian inverse problems are demonstrated on examples covering parabolic, elliptic and hyperbolic PDEs. This includes problems involving the heat and Poisson equations and application case studies in electrical impedance tomography and photo-acoustic tomography, showcasing the software's efficiency, consistency, and intuitive interface. This comprehensive approach to UQ in PDE-based inverse problems provides accessibility for non-experts and advanced features for experts.
comment: 38 pages, 12 figures
♻ ☆ CUQIpy: I. Computational uncertainty quantification for inverse problems in Python
This paper introduces CUQIpy, a versatile open-source Python package for computational uncertainty quantification (UQ) in inverse problems, presented as Part I of a two-part series. CUQIpy employs a Bayesian framework, integrating prior knowledge with observed data to produce posterior probability distributions that characterize the uncertainty in computed solutions to inverse problems. The package offers a high-level modeling framework with concise syntax, allowing users to easily specify their inverse problems, prior information, and statistical assumptions. CUQIpy supports a range of efficient sampling strategies and is designed to handle large-scale problems. Notably, the automatic sampler selection feature analyzes the problem structure and chooses a suitable sampler without user intervention, streamlining the process. With a selection of probability distributions, test problems, computational methods, and visualization tools, CUQIpy serves as a powerful, flexible, and adaptable tool for UQ in a wide selection of inverse problems. Part II of the series focuses on the use of CUQIpy for UQ in inverse problems with partial differential equations (PDEs).
comment: 37 pages, 11 figures
♻ ☆ A new open source framework for multiscale modeling of fibrous materials on heterogeneous supercomputers
This article presents MuMFiM, an open source application for multiscale modeling of fibrous materials on massively parallel computers. MuMFiM uses two scales to represent fibrous materials such as biological network materials (extracellular matrix, connective tissue, etc.). It is designed to make use of multiple levels of parallelism, including distributed parallelism of the macro and microscales as well as GPU accelerated data-parallelism of the microscale. Scaling results of the GPU accelerated microscale show that solving microscale problems concurrently on the GPU can lead to a 1000x speedup over the solution of a single RVE on the GPU. In addition, we show nearly optimal strong and weak scaling results of MuMFiM on up to 128 nodes of AiMOS (Rensselaer Polytechnic Institute) which is composed of IBM AC922 nodes with 6 Volta V100 GPU and 2 20 core Power 9 CPUs each. We also show how MuMFiM can be used to solve problems of interest to the broader engineering community, in particular providing an example of the facet capsule ligament (FCL) of the human spine undergoing uniaxial extension.
Data Structures and Algorithms
☆ Induced Subforests and Superforests
Graph isomorphism, subgraph isomorphism, and maximum common subgraphs are classical well-investigated objects. Their (parameterized) complexity and efficiently tractable cases have been studied. In the present paper, for a given set of forests, we study maximum common induced subforests and minimum common induced superforests. We show that finding a maximum subforest is NP-hard already for two subdivided stars while finding a minimum superforest is tractable for two trees but NP-hard for three trees. For a given set of $k$ trees, we present an efficient greedy $\left(\frac{k}{2}-\frac{1}{2}+\frac{1}{k}\right)$-approximation algorithm for the minimum superforest problem. Finally, we present a polynomial time approximation scheme for the maximum subforest problem for any given set of forests.
☆ A Differentially Private Clustering Algorithm for Well-Clustered Graphs
We study differentially private (DP) algorithms for recovering clusters in well-clustered graphs, which are graphs whose vertex set can be partitioned into a small number of sets, each inducing a subgraph of high inner conductance and small outer conductance. Such graphs have widespread application as a benchmark in the theoretical analysis of spectral clustering. We provide an efficient ($\epsilon$,$\delta$)-DP algorithm tailored specifically for such graphs. Our algorithm draws inspiration from the recent work of Chen et al., who developed DP algorithms for recovery of stochastic block models in cases where the graph comprises exactly two nearly-balanced clusters. Our algorithm works for well-clustered graphs with $k$ nearly-balanced clusters, and the misclassification ratio almost matches the one of the best-known non-private algorithms. We conduct experimental evaluations on datasets with known ground truth clusters to substantiate the prowess of our algorithm. We also show that any (pure) $\epsilon$-DP algorithm would result in substantial error.
☆ Space-Efficient Indexes for Uncertain Strings ICDE 2024
Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $\Sigma$ is a sequence of $n$ probability distributions over $\Sigma$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ is at least $\frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $\mathcal{O}(nz)$ size, requires $\mathcal{O}(nz)$ time and $\mathcal{O}(nz)$ space to be constructed, and answers pattern matching queries in the optimal $\mathcal{O}(m+|\text{Occ}|)$ time, where $m$ is the length of $P$ and $|\text{Occ}|$ is the total number of occurrences of $P$ in $X$. For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected size, which can be constructed using $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length $m\geq \ell$. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.
comment: Accepted to ICDE 2024. Abstract abridged to satisfy arXiv requirements
☆ Improved Algorithms for Maximum Coverage in Dynamic and Random Order Streams
The maximum coverage problem is to select $k$ sets from a collection of sets such that the cardinality of the union of the selected sets is maximized. We consider $(1-1/e-\epsilon)$-approximation algorithms for this NP-hard problem in three standard data stream models. 1. {\em Dynamic Model.} The stream consists of a sequence of sets being inserted and deleted. Our multi-pass algorithm uses $\epsilon^{-2} k \cdot \text{polylog}(n,m)$ space. The best previous result (Assadi and Khanna, SODA 2018) used $(n +\epsilon^{-4} k) \text{polylog}(n,m)$ space. While both algorithms use $O(\epsilon^{-1} \log n)$ passes, our analysis shows that when $\epsilon$ is a constant, it is possible to reduce the number of passes by a $1/\log \log n$ factor without incurring additional space. 2. {\em Random Order Model.} In this model, there are no deletions and the sets forming the instance are uniformly randomly permuted to form the input stream. We show that a single pass and $k \text{polylog}(n,m)$ space suffices for arbitrary small constant $\epsilon$. The best previous result, by Warneke et al.~(ESA 2023), used $k^2 \text{polylog}(n,m)$ space. 3. {\em Insert-Only Model.} Lastly, our results, along with numerous previous results, use a sub-sampling technique introduced by McGregor and Vu (ICDT 2017) to sparsify the input instance. We explain how this technique and others used in the paper can be implemented such that the amortized update time of our algorithm is polylogarithmic. This also implies an improvement of the state-of-the-art insert only algorithms in terms of the update time: $\text{polylog}(m,n)$ update time suffices whereas the best previous result by Jaud et al.~(SEA 2023) required update time that was linear in $k$.
♻ ☆ $(1-ε)$-Approximation of Knapsack in Nearly Quadratic Time
Knapsack is one of the most fundamental problems in theoretical computer science. In the $(1 - \epsilon)$-approximation setting, although there is a fine-grained lower bound of $(n + 1 / \epsilon) ^ {2 - o(1)}$ based on the $(\min, +)$-convolution hypothesis ([K{\"u}nnemann, Paturi and Stefan Schneider, ICALP 2017] and [Cygan, Mucha, Wegrzycki and Wlodarczyk, 2017]), the best algorithm is randomized and runs in $\tilde O\left(n + (\frac{1}{\epsilon})^{11/5}/2^{\Omega(\sqrt{\log(1/\epsilon)})}\right)$ time [Deng, Jin and Mao, SODA 2023], and it remains an important open problem whether an algorithm with a running time that matches the lower bound (up to a sub-polynomial factor) exists. We answer the question positively by showing a deterministic $(1 - \epsilon)$-approximation scheme for knapsack that runs in $\tilde O(n + (1 / \epsilon) ^ {2})$ time. We first extend a known lemma in a recursive way to reduce the problem to $n \epsilon$-additive approximation for $n$ items with profits in $[1, 2)$. Then we give a simple efficient geometry-based algorithm for the reduced problem.
♻ ☆ On Rotation Distance of Rank Bounded Trees
Computing the rotation distance between two binary trees with $n$ internal nodes efficiently (in $poly(n)$ time) is a long standing open question in the study of height balancing in tree data structures. In this paper, we initiate the study of this problem bounding the rank of the trees given at the input (defined by Ehrenfeucht and Haussler (1989) in the context of decision trees). We define the rank-bounded rotation distance between two given binary trees $T_1$ and $T_2$ (with $n$ internal nodes) of rank at most $r$, denoted by $d_r(T_1,T_2)$, as the length of the shortest sequence of rotations that transforms $T_1$ to $T_2$ with the restriction that the intermediate trees must be of rank at most $r$. We show that the rotation distance problem reduces in polynomial time to the rank bounded rotation distance problem. This motivates the study of the problem in the combinatorial and algorithmic frontiers. Observing that trees with rank $1$ coincide exactly with skew trees (binary trees where every internal node has at least one leaf as a child), we show the following results in this frontier : We present an $O(n^2)$ time algorithm for computing $d_1(T_1,T_2)$. That is, when the given trees are skew trees (we call this variant as skew rotation distance problem) - where the intermediate trees are restricted to be skew as well. In particular, our techniques imply that for any two skew trees $d(T_1,T_2) \le n^2$. We show the following upper bound : for any two trees $T_1$ and $T_2$ of rank at most $r_1$ and $r_2$ respectively, we have that: $d_r(T_1,T_2) \le n^2 (1+(2n+1)(r_1+r_2-2))$ where $r = max\{r_1,r_2\}$. This bound is asymptotically tight for $r=1$. En route our proof of the above theorems, we associate binary trees to permutations and bivariate polynomials, and prove several characterizations in the case of skew trees.
comment: 28 pages, 2 figures, Abstract shortened to meet arxiv requirements, accepted journal version
♻ ☆ Pushing the Limits: Concurrency Detection in Acyclic Sound Free-Choice Workflow Nets in $O(P^2 + T^2)$
Concurrency is an important aspect of Petri nets to describe and simulate the behavior of complex systems. Knowing which places and transitions could be executed in parallel helps to understand nets and enables analysis techniques and the computation of other properties, such as causality, exclusivity, etc.. All techniques based on concurrency detection depend on the efficiency of this detection methodology. Kovalyov and Esparza have developed algorithms that compute all concurrent places in $O\big((P+T)TP^2\big)$ for live and bounded nets (where $P$ and $T$ are the numbers of places and transitions) and in $O\big(P(P+T)^2\big)$ for live and bounded free-choice nets. Although these algorithms have a reasonably good computational complexity, large numbers of concurrent pairs of nodes may still lead to long computation times. This paper complements the palette of concurrency detection algorithms with the Concurrent Paths (CP) algorithm for sound free-choice workflow nets. The algorithm allows parallelization and has a worst-case computational complexity of $O(P^2 + T^2)$ for acyclic nets and of $O(P^3 + PT^2)$ for cyclic nets. Although the computational complexity of cyclic nets has not improved, the evaluation shows the benefits of CP, especially, if the net contains many nodes in concurrency relation.
comment: 19 pages, 14 figures, 5 algorithms
♻ ☆ Subexponential parameterized algorithms in square graphs and intersection graphs of thin objects
In this paper we investigate the existence of subexponential parameterized algorithms of three fundamental cycle-hitting problems in geometric graph classes. The considered problems, \textsc{Triangle Hitting} (TH), \textsc{Feedback Vertex Set} (FVS), and \textsc{Odd Cycle Transversal} (OCT) ask for the existence in a graph $G$ of a set $X$ of at most $k$ vertices such that $G-X$ is, respectively, triangle-free, acyclic, or bipartite. Such subexponential parameterized algorithms are known to exist in planar and even $H$-minor free graphs from bidimensionality theory [Demaine et al., JACM 2005], and there is a recent line of work lifting these results to geometric graph classes consisting of intersection of "fat" objects ([Grigoriev et al., FOCS 2022] and [Lokshtanov et al., SODA 2022]). In this paper we focus on "thin" objects by considering intersection graphs of segments in the plane with $d$ possible slopes ($d$-DIR graphs) and contact graphs of segments in the plane. Assuming the ETH, we rule out the existence of algorithms: - solving TH in time $2^{o(n)}$ in 2-DIR graphs; and - solving TH, FVS, and OCT in time $2^{o(\sqrt{n})}$ in $K_{2,2}$-free contact 2-DIR graphs. These results indicate that additional restrictions are necessary in order to obtain subexponential parameterized algorithms for %these problems. In this direction we provide: - a $2^{O(k^{3/4}\cdot \log k)}n^{O(1)}$-time algorithm for FVS in contact segment graphs; - a $2^{O(\sqrt d\cdot t^2 \log t\cdot k^{2/3}\log k)} n^{O(1)}$-time algorithm for TH in $K_{t,t}$-free $d$-DIR graphs; and - a $2^{O(k^{7/9}\log^{3/2}k)} n^{O(1)}$-time algorithm for TH in contact segment graphs.
♻ ☆ The Role of Transparency in Repeated First-Price Auctions with Unknown Valuations STOC 2024
We study the problem of regret minimization for a single bidder in a sequence of first-price auctions where the bidder discovers the item's value only if the auction is won. Our main contribution is a complete characterization, up to logarithmic factors, of the minimax regret in terms of the auction's \emph{transparency}, which controls the amount of information on competing bids disclosed by the auctioneer at the end of each auction. Our results hold under different assumptions (stochastic, adversarial, and their smoothed variants) on the environment generating the bidder's valuations and competing bids. These minimax rates reveal how the interplay between transparency and the nature of the environment affects how fast one can learn to bid optimally in first-price auctions.
comment: Accepted at STOC 2024
♻ ☆ The Complexity of Temporal Vertex Cover in Small-Degree Graphs
Temporal graphs naturally model graphs whose underlying topology changes over time. Recently, the problems TEMPORAL VERTEX COVER (or TVC) and SLIDING-WINDOW TEMPORAL VERTEX COVER(or $\Delta$-TVC for time-windows of a fixed-length $\Delta$) have been established as natural extensions of the classic problem VERTEX COVER on static graphs with connections to areas such as surveillance in sensor networks. In this paper we initiate a systematic study of the complexity of TVC and $\Delta$-TVC on sparse graphs. Our main result shows that for every $\Delta\geq 2$, $\Delta$-TVC is NP-hard even when the underlying topology is described by a path or a cycle. This resolves an open problem from literature and shows a surprising contrast between $\Delta$-TVC and TVC for which we provide a polynomial-time algorithm in the same setting. To circumvent this hardness, we present a number of exact and approximation algorithms for temporal graphs whose underlying topologies are given by a path, that have bounded vertex degree in every time step, or that admit a small-sized temporal vertex cover.
comment: Changes to section 4.2.2
♻ ☆ Efficient Approximation of Quantum Channel Fidelity Exploiting Symmetry
Determining the optimal fidelity for the transmission of quantum information over noisy quantum channels is one of the central problems in quantum information theory. Recently, [Berta-Borderi-Fawzi-Scholz, Mathematical Programming, 2021] introduced an asymptotically converging semidefinite programming hierarchy of outer bounds for this quantity. However, the size of the semidefinite programs (SDPs) grows exponentially with respect to the level of the hierarchy, thus making their computation unscalable. In this work, by exploiting the symmetries in the SDP, we show that, for a fixed output dimension of the quantum channel, we can compute the SDP in time polynomial with respect to the level of the hierarchy and input dimension. As a direct consequence of our result, the optimal fidelity can be approximated with an accuracy of $\epsilon$ in $\mathrm{poly}(1/\epsilon, \text{input dimension})$ time.
♻ ☆ On the Congruency-Constrained Matroid Base
Consider a matroid where all elements are labeled with an element in $\mathbb{Z}$. We are interested in finding a base where the sum of the labels is congruent to $g \pmod m$. We show that this problem can be solved in $\tilde{O}(2^{4m} n r^{5/6})$ time for a matroid with $n$ elements and rank $r$, when $m$ is either the product of two primes or a prime power. The algorithm can be generalized to all moduli and, in fact, to all abelian groups if a classic additive combinatorics conjecture by Schrijver and Seymour holds true. We also discuss the optimization version of the problem.
♻ ☆ Towards Optimal Output-Sensitive Clique Listing or: Listing Cliques from Smaller Cliques
We study finding and listing $k$-cliques in a graph, for constant $k\geq 3$, a fundamental problem of both theoretical and practical importance. Our main contribution is a new output-sensitive algorithm for listing $k$-cliques in graphs, for arbitrary $k\geq 3$, coupled with lower bounds based on standard fine-grained assumptions, showing that our algorithm's running time is tight. Previously, the only known conditionally optimal output-sensitive algorithms were for the case of $3$-cliques by Bj\"{o}rklund, Pagh, Vassilevska W. and Zwick [ICALP'14]. Typical inputs to subgraph isomorphism or listing problems are measured by the number of nodes $n$ or the number of edges $m$. Our framework is very general in that it gives $k$-clique listing algorithms whose running times are measured in terms of the number of $\ell$-cliques $\Delta_\ell$ in the graph for any $1\leq \ell
comment: 48 pages, 5 figures
♻ ☆ Randomly punctured Reed--Solomon codes achieve list-decoding capacity over linear-sized fields
Reed--Solomon codes are a classic family of error-correcting codes consisting of evaluations of low-degree polynomials over a finite field on some sequence of distinct field elements. They are widely known for their optimal unique-decoding capabilities, but their list-decoding capabilities are not fully understood. Given the prevalence of Reed-Solomon codes, a fundamental question in coding theory is determining if Reed--Solomon codes can optimally achieve list-decoding capacity. A recent breakthrough by Brakensiek, Gopi, and Makam, established that Reed--Solomon codes are combinatorially list-decodable all the way to capacity. However, their results hold for randomly-punctured Reed--Solomon codes over an exponentially large field size $2^{O(n)}$, where $n$ is the block length of the code. A natural question is whether Reed--Solomon codes can still achieve capacity over smaller fields. Recently, Guo and Zhang showed that Reed--Solomon codes are list-decodable to capacity with field size $O(n^2)$. We show that Reed--Solomon codes are list-decodable to capacity with linear field size $O(n)$, which is optimal up to the constant factor. We also give evidence that the ratio between the alphabet size $q$ and code length $n$ cannot be bounded by an absolute constant. Our techniques also show that random linear codes are list-decodable up to (the alphabet-independent) capacity with optimal list-size $O(1/\varepsilon)$ and near-optimal alphabet size $2^{O(1/\varepsilon^2)}$, where $\varepsilon$ is the gap to capacity. As far as we are aware, list-decoding up to capacity with optimal list-size $O(1/\varepsilon)$ was previously not known to be achievable with any linear code over a constant alphabet size (even non-constructively). Our proofs are based on the ideas of Guo and Zhang, and we additionally exploit symmetries of reduced intersection matrices.
♻ ☆ Gaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling
The connections between (convex) optimization and (logconcave) sampling have been considerably enriched in the past decade with many conceptual and mathematical analogies. For instance, the Langevin algorithm can be viewed as a sampling analogue of gradient descent and has condition-number-dependent guarantees on its performance. In the early 1990s, Nesterov and Nemirovski developed the Interior-Point Method (IPM) for convex optimization based on self-concordant barriers, providing efficient algorithms for structured convex optimization, often faster than the general method. This raises the following question: can we develop an analogous IPM for structured sampling problems? In 2012, Kannan and Narayanan proposed the Dikin walk for uniformly sampling polytopes, and an improved analysis was given in 2020 by Laddha-Lee-Vempala. The Dikin walk uses a local metric defined by a self-concordant barrier for linear constraints. Here we generalize this approach by developing and adapting IPM machinery together with the Dikin walk for poly-time sampling algorithms. Our IPM-based sampling framework provides an efficient warm start and goes beyond uniform distributions and linear constraints. We illustrate the approach on important special cases, in particular giving the fastest algorithms to sample uniform, exponential, or Gaussian distributions on a truncated PSD cone. The framework is general and can be applied to other sampling algorithms.
comment: Improved writing with minor errors fixed
Signal Processing
☆ Optimization for MIMO Integrated Sensing and Communications
The fundamentals of MIMO communications and MIMO sensing are firstly analyzed with regard to channel and sensing capacities. It is shown that the different objectives of communications and sensing lead to different signaling waveforms required for achieving their capacities. Hence, the optimization of integrated sensing and communications (ISAC) is relied on a trade-off expected between the performance of communications and that of sensing. Following this observation, the design and resource optimization in general MIMO ISAC systems are discussed along with the analysis of some existing ISAC schemes. Furthermore, the design of ISAC in mmWave communications is addressed. Specifically, the principle of sensing in mmWave systems is established, and a range of optimization alternatives for ISAC design in mmWave systems are reviewed.
comment: Book chapter, 22 pages, 4 figures
☆ Bistatic Doppler Frequency Estimation with Asynchronous Moving Devices for Integrated Sensing and Communications
In this letter, we present for the first time a method to estimate the bistatic Doppler frequency of a target with clock asynchronous and mobile Integrated Sensing And Communication (ISAC) devices. Existing approaches have separately tackled the presence of phase offsets due to clock asynchrony or the additional Doppler shift due to device movement. However, in real ISAC scenarios, these two sources of phase nuisance are concurrently present, making the estimation of the target's Doppler frequency particularly challenging. Our method solves the problem using the sole wireless signal at the receiver, by computing Channel Impulse Response (CIR) phase differences across different multipath components and subsequent time instants. In this way, we cancel out phase offsets. Then, we construct a system of equations that allows disentangling the target's Doppler frequency from that of the moving device. The proposed method is validated via simulation, exploring the impact of different system parameters. Numerical results show that our approach is a viable way of estimating Doppler frequency in bistatic asynchronous ISAC scenarios with mobile devices.
☆ An Efficient Rate Splitting Precoding Approach in Multi-User MISO FDD Systems
In this work, we develop an efficient precoding strategy for a multi-user multiple-input-single output (MU MISO) system operating in frequency-division-duplex (FDD) mode, where rate splitting multiple access (RSMA) is implemented. To this end, we consider one-layer RS and show its significant impact on the system performance, specifically in the case where the channel state information (CSI) is incomplete at the transmitter. Based on a lower bound on the achievable rate that takes into account the CSI errors, we establish an augmented weighted average mean squared error (AWAMSE) algorithm for the RS setup denoted by AWAMSE-RS, where even the updates for the common and the private precoders are computed via analytical expressions, hence circumventing the need for interior-point methods. Simulation results validate the efficiency of our approach in terms of computational time and its competitiveness in terms of the achievable system throughput compared to state-of-the-art methods and non-RS setups.
☆ Modem Optimization of High-Mobility Scenarios: A Deep-Learning-Inspired Approach
The next generation wireless communication networks are required to support high-mobility scenarios, such as reliable data transmission for high-speed railways. Nevertheless, widely utilized multi-carrier modulation, the orthogonal frequency division multiplex (OFDM), cannot deal with the severe Doppler spread brought by high mobility. To address this problem, some new modulation schemes, e.g. orthogonal time frequency space and affine frequency division multiplexing, have been proposed with different design criteria from OFDM, which promote reliability with the cost of extremely high implementation complexity. On the other hand, end-to-end systems achieve excellent gains by exploiting neural networks to replace traditional transmitters and receivers, but have to retrain and update continually with channel varying. In this paper, we propose the Modem Network (ModNet) to design a novel modem scheme. Compared with end-to-end systems, channels are directly fed into the network and we can directly get a modem scheme through ModNet. Then, the Tri-Phase training strategy is proposed, which mainly utilizes the siamese structure to unify the learned modem scheme without retraining frequently faced up with time-varying channels. Simulation results show the proposed modem scheme outperforms OFDM systems under different highmobility channel statistics.
comment: 6 pages, 4 figures, accepted by ICC 2024 Workshop - APATN
☆ Learning-to-Learn the Wave Angle Estimation
A precise incident wave angle estimation in aerial communication is a key enabler in sixth-generation wireless communication network. With this goal, a generic 3-dimensional (3D) channel model is analyzed for air-to-air (A2A) networks under antenna misalignment, radio frequency impairments and polarization loss. The unique aspects of each aerial node are highlighted and the few-shot learning as a model agnostic meta-learning (MAML) classifier is proposed for learning-to-learn (L2L) incident wave angle estimation by utilizing the received signal strength (RSS). Additionally, a more computationally efficient technique, first order model agnostic meta-learning (FOMAML) is implemented. It has been observed that the proposed approach reaches up to 85% training accuracy and 75.4% evaluation accuracy with MAML. Regarding this, a convergence rate and accuracy trade-off have been established for several cases of MAML and FOMAML. For different L2L models trained with limited data, heuristic accuracy performance is determined by an upper bound of the probability of confidence.
comment: 18 pages, 9 figures
☆ Fundamentals of Delay-Doppler Communications: Practical Implementation and Extensions to OTFS
The recently proposed orthogonal time frequency space (OTFS) modulation, which is a typical Delay-Doppler (DD) communication scheme, has attracted significant attention thanks to its appealing performance over doubly-selective channels. In this paper, we present the fundamentals of general DD communications from the viewpoint of the Zak transform. We start our study by constructing DD domain basis functions aligning with the time-frequency (TF)-consistency condition, which are globally quasi-periodic and locally twisted-shifted. We unveil that these features are translated to unique signal structures in both time and frequency, which are beneficial for communication purposes. Then, we focus on the practical implementations of DD Nyquist communications, where we show that rectangular windows achieve perfect DD orthogonality, while truncated periodic signals can obtain sufficient DD orthogonality. Particularly, smoothed rectangular window with excess bandwidth can result in a slightly worse orthogonality but better pulse localization in the DD domain. Furthermore, we present a practical pulse shaping framework for general DD communications and derive the corresponding input-output relation under various shaping pulses. Our numerical results agree with our derivations and also demonstrate advantages of DD communications over conventional orthogonal frequency-division multiplexing (OFDM).
☆ A LiDAR-Aided Channel Model for Vehicular Intelligent Sensing-Communication Integration
In this paper, a novel channel modeling approach, named light detection and ranging (LiDAR)-aided geometry-based stochastic modeling (LA-GBSM), is developed. Based on the developed LA-GBSM approach, a new millimeter wave (mmWave) channel model for sixth-generation (6G) vehicular intelligent sensing-communication integration is proposed, which can support the design of intelligent transportation systems (ITSs). The proposed LA-GBSM is accurately parameterized under high, medium, and low vehicular traffic density (VTD) conditions via a sensing-communication simulation dataset with LiDAR point clouds and scatterer information for the first time. Specifically, by detecting dynamic vehicles and static building/tress through LiDAR point clouds via machine learning, scatterers are divided into static and dynamic scatterers. Furthermore, statistical distributions of parameters, e.g., distance, angle, number, and power, related to static and dynamic scatterers are quantified under high, medium, and low VTD conditions. To mimic channel non-stationarity and consistency, based on the quantified statistical distributions, a new visibility region (VR)-based algorithm in consideration of newly generated static/dynamic scatterers is developed. Key channel statistics are derived and simulated. By comparing simulation results and ray-tracing (RT)-based results, the utility of the proposed LA-GBSM is verified.
☆ Adaptive Target Detection for FDA-MIMO Radar with Training Data in Gaussian noise
This paper addresses the problem of detecting a moving target embedded in Gaussian noise with an unknown covariance matrix for frequency diverse array multiple-input multiple-output (FDA-MIMO) radar. To end it, assume that obtaining a set of training data is available. Moreover, we propose three adaptive detectors in accordance with the one-step generalized likelihood ratio test (GLRT), two-step GLRT, and Rao criteria, namely OGLRT, TGLRT, and Rao. The LH adaptive matched filter (LHAMF) detector is also introduced when decomposing the Rao test. Next, all provided detectors have constant false alarm rate (CFAR) properties against the covariance matrix. Besides, the closed-form expressions for false alarm probability (PFA) and detection probability (PD) are derived. Finally, this paper substantiates the correctness of the aforementioned algorithms through numerical simulations.
☆ Sub-Nyquist Sampling OFDM Radar With a Time-Frequency Phase-Coded Waveform
This paper presents a time-frequency phase-coded sub-Nyquist sampling orthogonal frequency division multiplexing (PC-SNS-OFDM) radar system to reduce the analog-to-digital converter (ADC) sampling rate without any additional hardware or signal processing. The proposed radar divides the transmitted OFDM signal into multiple sub-bands along the frequency axis and provides orthogonality to these sub-bands by multiplying phase codes in both the time and frequency domains. Although the sampling rate is reduced by the factor of the number of sub-bands, the sub-bands above the sampling rate are folded into the lowest one due to aliasing. In the process of restoring the signals in folded sub-bands to those in full signal bands, the proposed PC-SNS-OFDM radar effectively eliminates symbol-mismatch noise while introducing trade-offs in the range and Doppler ambiguities. The utilization of phase codes in both the frequency and time domains provides flexible control of the range and Doppler ambiguities. It also improves the signal-to-noise ratio (SNR) of detected targets compared to an earlier sub-Nyquist sampling OFDM radar system. This is validated with simulations and experiments under various sub-Nyquist sampling rates.
☆ Advancing IIoT with Over-the-Air Federated Learning: The Role of Iterative Magnitude Pruning
The industrial Internet of Things (IIoT) under Industry 4.0 heralds an era of interconnected smart devices where data-driven insights and machine learning (ML) fuse to revolutionize manufacturing. A noteworthy development in IIoT is the integration of federated learning (FL), which addresses data privacy and security among devices. FL enables edge sensors, also known as peripheral intelligence units (PIUs) to learn and adapt using their data locally, without explicit sharing of confidential data, to facilitate a collaborative yet confidential learning process. However, the lower memory footprint and computational power of PIUs inherently require deep neural network (DNN) models that have a very compact size. Model compression techniques such as pruning can be used to reduce the size of DNN models by removing unnecessary connections that have little impact on the model's performance, thus making the models more suitable for the limited resources of PIUs. Targeting the notion of compact yet robust DNN models, we propose the integration of iterative magnitude pruning (IMP) of the DNN model being trained in an over-the-air FL (OTA-FL) environment for IIoT. We provide a tutorial overview and also present a case study of the effectiveness of IMP in OTA-FL for an IIoT environment. Finally, we present future directions for enhancing and optimizing these deep compression techniques further, aiming to push the boundaries of IIoT capabilities in acquiring compact yet robust and high-performing DNN models.
comment: 6 pages, 6 figures
☆ Crowdsourced Multilingual Speech Intelligibility Testing
With the advent of generative audio features, there is an increasing need for rapid evaluation of their impact on speech intelligibility. Beyond the existing laboratory measures, which are expensive and do not scale well, there has been comparatively little work on crowdsourced assessment of intelligibility. Standards and recommendations are yet to be defined, and publicly available multilingual test materials are lacking. In response to this challenge, we propose an approach for a crowdsourced intelligibility assessment. We detail the test design, the collection and public release of the multilingual speech data, and the results of our early experiments.
☆ RIS-Aided Cooperative Mobile Edge Computing: Computation Efficiency Maximization via Joint Uplink and Downlink Resource Allocation
In mobile edge computing (MEC) systems, the wireless channel condition is a critical factor affecting both the communication power consumption and computation rate of the offloading tasks. This paper exploits the idea of cooperative transmission and employing reconfigurable intelligent surface (RIS) in MEC to improve the channel condition and maximize computation efficiency (CE). The resulting problem couples various wireless resources in both uplink and downlink, which calls for the joint design of the user association, receive/downlink beamforming vectors, transmit power of users, task partition strategies for local computing and offloading, and uplink/downlink phase shifts at the RIS. To tackle the challenges brought by the combinatorial optimization problem, the group sparsity structure of the beamforming vectors determined by user association is exploited. Furthermore, while the CE does not explicitly depend on the downlink phase shifts, instead of simply finding a feasible solution, we exploit the hidden relationship between them and convert this relationship into an explicit form for optimization. Then the resulting problem is solved via the alternating maximization framework, and the nonconvexity of each subproblem is handled individually. Simulation results show that cooperative transmission and RIS deployment can significantly improve the CE and demonstrate the importance of optimizing the downlink phase shifts with an explicit form.
comment: This paper has been accepted for publication in IEEE Transactions on Wireless Communications
☆ Improving Galileo OSNMA Time To First Authenticated Fix
Galileo is the first global navigation satellite system to authenticate their civilian signals through the Open Service Galileo Message Authentication (OSNMA) protocol. However, OSNMA delays the time to obtain a first position and time fix, the so-called Time To First Authentication Fix (TTFAF). Reducing the TTFAF as much as possible is crucial to integrate the technology seamlessly into the current products. In the cases where the receiver already has cryptographic data available, the so-called hot start mode and focus of this article, the currently available implementations achieve an average TTFAF of around 100 seconds in ideal environments. In this work, we dissect the TTFAF process, propose two main optimizations to reduce the TTFAF, and benchmark them in three distinct scenarios (open-sky, soft urban, and hard urban) with recorded real data. Moreover, we evaluate the optimizations using the synthetic scenario from the official OSNMA test vectors. The first block of optimizations centers on extracting as much information as possible from broken sub-frames by processing them at page level and combining redundant data from multiple satellites. The second block of optimizations aims to reconstruct missed navigation data by using fields in the authentication tags belonging to the same sub-frame as the authentication key. Combining both optimizations improves the TTFAF substantially for all considered scenarios. We obtain an average TTFAF of 60.9 and 68.8 seconds for the test vectors and the open-sky scenario, respectively, with a best-case of 44.0 seconds in both. Likewise, the urban scenarios see a drastic reduction of the average TTFAF between the non-optimized and optimized cases, from 127.5 to 87.5 seconds in the soft urban scenario and from 266.1 to 146.1 seconds in the hard urban scenario. These optimizations are available as part of the open-source OSNMAlib library on GitHub.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 12 pages, 15 figures
☆ A task of anomaly detection for a smart satellite Internet of things system
When the equipment is working, real-time collection of environmental sensor data for anomaly detection is one of the key links to prevent industrial process accidents and network attacks and ensure system security. However, under the environment with specific real-time requirements, the anomaly detection for environmental sensors still faces the following difficulties: (1) The complex nonlinear correlation characteristics between environmental sensor data variables lack effective expression methods, and the distribution between the data is difficult to be captured. (2) it is difficult to ensure the real-time monitoring requirements by using complex machine learning models, and the equipment cost is too high. (3) Too little sample data leads to less labeled data in supervised learning. This paper proposes an unsupervised deep learning anomaly detection system. Based on the generative adversarial network and self-attention mechanism, considering the different feature information contained in the local subsequences, it automatically learns the complex linear and nonlinear dependencies between environmental sensor variables, and uses the anomaly score calculation method combining reconstruction error and discrimination error. It can monitor the abnormal points of real sensor data with high real-time performance and can run on the intelligent satellite Internet of things system, which is suitable for the real working environment. Anomaly detection outperforms baseline methods in most cases and has good interpretability, which can be used to prevent industrial accidents and cyber-attacks for monitoring environmental sensors.
☆ Multiple and Gyro-Free Inertial Datasets
An inertial navigation system (INS) utilizes three orthogonal accelerometers and gyroscopes to determine platform position, velocity, and orientation. There are countless applications for INS, including robotics, autonomous platforms, and the internet of things. Recent research explores the integration of data-driven methods with INS, highlighting significant innovations, improving accuracy and efficiency. Despite the growing interest in this field and the availability of INS datasets, no datasets are available for gyro-free INS (GFINS) and multiple inertial measurement unit (MIMU) architectures. To fill this gap and to stimulate further research in this field, we designed and recorded GFINS and MIMU datasets using 54 inertial sensors grouped in nine inertial measurement units. These sensors can be used to define and evaluate different types of MIMU and GFINS architectures. The inertial sensors were arranged in three different sensor configurations and mounted on a mobile robot and a passenger car. In total, the dataset contains 35 hours of inertial data and corresponding ground truth trajectories. The data and code are freely accessible through our GitHub repository.
comment: 10 pages, 16 figures, 6 tables
☆ EEG decoding with conditional identification information SP
Decoding EEG signals is crucial for unraveling human brain and advancing brain-computer interfaces. Traditional machine learning algorithms have been hindered by the high noise levels and inherent inter-person variations in EEG signals. Recent advances in deep neural networks (DNNs) have shown promise, owing to their advanced nonlinear modeling capabilities. However, DNN still faces challenge in decoding EEG samples of unseen individuals. To address this, this paper introduces a novel approach by incorporating the conditional identification information of each individual into the neural network, thereby enhancing model representation through the synergistic interaction of EEG and personal traits. We test our model on the WithMe dataset and demonstrated that the inclusion of these identifiers substantially boosts accuracy for both subjects in the training set and unseen subjects. This enhancement suggests promising potential for improving for EEG interpretability and understanding of relevant identification features.
comment: Accepted by 6th International Conference on Advances in Signal Processing and Artificial Intelligence (ASPAI' 2024)
☆ Rolling bearing fault diagnosis method based on generative adversarial enhanced multi-scale convolutional neural network model
In order to solve the problem that current convolutional neural networks can not capture the correlation features between the time domain signals of rolling bearings effectively, and the model accuracy is limited by the number and quality of samples, a rolling bearing fault diagnosis method based on generative adversarial enhanced multi-scale convolutional neural network model is proposed. Firstly, Gram angular field coding technique is used to encode the time domain signal of the rolling bearing and generate the feature map to retain the complete information of the vibration signal. Then, the re-sulting data is divided into a training set, a validation set, and a test set. Among them, the training set is input into the gradient penalty Wasserstein distance generation adversarial network to complete the training, and a new sample with similar features to the training sample is obtained, and then the original training set is expanded. Next, multi-scale convolution is used to extract the fault features of the extended training set, and the feature graph is normalized by example to overcome the influence of the difference in feature distribution. Finally, the attention mechanism is applied to the adaptive weighting of normalized features and the extraction of deep features, and the fault diagnosis is completed by the softmax classifier. Compared with ResNet method, the experimental results show that the proposed method has better generalization performance and anti-noise performance.
♻ ☆ Constellation Shaping under Phase Noise Impairment for Sub-THz Communications
The large untapped spectrum in the sub-THz allows for ultra-high throughput communication to realize many seemingly impossible applications in 6G. One of the challenges in radio communications in sub-THz is the hardware impairments. Specifically, phase noise is one key hardware impairment, which is accentuated as we increase the frequency and bandwidth. Furthermore, the moderate output power of the sub-THz power amplifier demands limits on peak to average power ratio (PAPR) signal design. Single carrier frequency domain equalization (SC-FDE) has been identified as a suitable candidate for sub-THz, although some challenges such as phase noise and PAPR still remain to be tackled. In this work, we design a phase noise robust, modest PAPR SC waveform by geometrically shaping the constellation under practical conditions. We formulate the waveform optimization problem in its augmented Lagrangian form and use a back-propagation-inspired technique to obtain a constellation design that is numerically robust to phase noise, while maintaining a relatively low PAPR compared to the conventional waveforms.
comment: To appear in IEEE ICC 2024
♻ ☆ Event Encryption: Rethinking Privacy Exposure for Neuromorphic Imaging
Bio-inspired neuromorphic cameras sense illumination changes on a per-pixel basis and generate spatiotemporal streaming events within microseconds in response, offering visual information with high temporal resolution over a high dynamic range. Such devices often serve in surveillance systems due to their applicability and robustness in environments with high dynamics and harsh lighting, where they can still supply clearer recordings than traditional imaging. In other words, when it comes to privacy-relevant cases, neuromorphic cameras also expose more sensitive data and pose serious security threats. Therefore, asynchronous event streams necessitate careful encryption before transmission and usage. This work discusses several potential attack scenarios and approaches event encryption from the perspective of neuromorphic noise removal, in which we inversely introduce well-crafted noise into raw events until they are obfuscated. Our evaluations show that the encrypted events can effectively protect information from attacks of low-level visual reconstruction and high-level neuromorphic reasoning, and thus feature dependable privacy-preserving competence. The proposed solution gives impetus to the security of event data and paves the way to a highly encrypted technique for privacy-protective neuromorphic imaging.
comment: 6 pages, 6 figures, and 2 tables. Accepted by IOP Neuromorphic Computing and Engineering
♻ ☆ Real-time Graph Building on FPGAs for Machine Learning Trigger Applications in Particle Physics
We present a design methodology that enables the semi-automatic generation of a hardware-accelerated graph building architectures for locally constrained graphs based on formally described detector definitions. In addition, we define a similarity measure in order to compare our locally constrained graph building approaches with commonly used k-nearest neighbour building approaches. To demonstrate the feasibility of our solution for particle physics applications, we implemented a real-time graph building approach in a case study for the Belle~II central drift chamber using Field-Programmable Gate Arrays~(FPGAs). Our presented solution adheres to all throughput and latency constraints currently present in the hardware-based trigger of the Belle~II experiment. We achieve constant time complexity at the expense of linear space complexity and thus prove that our automated methodology generates online graph building designs suitable for a wide range of particle physics applications. By enabling an hardware-accelerated pre-processing of graphs, we enable the deployment of novel Graph Neural Networks~(GNNs) in first level triggers of particle physics experiments.
comment: 20 pages
♻ ☆ Exploring Hybrid Active-Passive RIS-Aided MEC Systems: From the Mode-Switching Perspective
Mobile edge computing (MEC) has been regarded as a promising technique to support latencysensitivity and computation-intensive serves. However, the low offloading rate caused by the random channel fading characteristic becomes a major bottleneck in restricting the performance of the MEC. Fortunately, reconfigurable intelligent surface (RIS) can alleviate this problem since it can boost both the spectrum- and energy- efficiency. Different from the existing works adopting either fully active or fully passive RIS, we propose a novel hybrid RIS in which reflecting units can flexibly switch between active and passive modes. To achieve a tradeoff between the latency and energy consumption, an optimization problem is formulated by minimizing the total cost. In light of the intractability of the problem, we develop an alternating optimization-based iterative algorithm by combining the successive convex approximation method, the variable substitution, and the singular value decomposition (SVD) to obtain sub-optimal solutions. Furthermore, in order to gain more insight into the problem, we consider two special cases involving a latency minimization problem and an energy consumption minimization problem, and respectively analyze the tradeoff between the number of active and passive units. Simulation results verify that the proposed algorithm can achieve flexible mode switching and significantly outperforms existing algorithms.
♻ ☆ Time-Synchronized Full System State Estimation Considering Practical Implementation Challenges
As the phasor measurement unit (PMU) placement problem involves a cost-benefit trade-off, more PMUs get placed on the higher voltage buses. However, this causes many of the lower voltage levels of the bulk power system to not be observed by PMUs. This lack of visibility then makes time-synchronized state estimation of the full system a challenging problem. We propose a Deep Neural network-based State Estimator (DeNSE) to overcome this problem. The DeNSE employs a Bayesian framework to indirectly combine inferences drawn from slow timescale but widespread supervisory control and data acquisition (SCADA) data with fast timescale but select PMU data to attain sub-second situational awareness of the entire system. The practical utility of the proposed approach is demonstrated by considering topology changes, non-Gaussian measurement noise, and bad data detection and correction. The results obtained using the IEEE 118-bus system show the superiority of the DeNSE over a purely SCADA state estimator and a PMU-only linear state estimator from a techno-economic viability perspective. Lastly, scalability of the DeNSE is proven by estimating the states of a large and realistic 2000-bus Synthetic Texas system.
Sound
☆ XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
☆ Exploring Green AI for Audio Deepfake Detection
The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is equivalent to five times of average US car emission at its lifetime. This is certainly a massive threat to the environment. To tackle this challenge, this study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources. Our proposed framework utilizes off-the-shelve self-supervised learning (SSL) based models which are pre-trained and available in public repositories. In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model. Our approach shows competitive results compared to the commonly used high-carbon footprint approaches. In experiments with the ASVspoof 2019 LA dataset, we achieve a 0.90\% equal error rate (EER) with less than 1k trainable model parameters. To encourage further research in this direction and support reproducible results, the Python code will be made publicly accessible following acceptance. Github: https://github.com/sahasubhajit/Speech-Spoofing-
comment: This manuscript is under review in a conference
☆ Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
comment: Manuscript Under Review
☆ Speech-Aware Neural Diarization with Encoder-Decoder Attractor Guided by Attention Constraints TAAI
End-to-End Neural Diarization with Encoder-Decoder based Attractor (EEND-EDA) is an end-to-end neural model for automatic speaker segmentation and labeling. It achieves the capability to handle flexible number of speakers by estimating the number of attractors. EEND-EDA, however, struggles to accurately capture local speaker dynamics. This work proposes an auxiliary loss that aims to guide the Transformer encoders at the lower layer of EEND-EDA model to enhance the effect of self-attention modules using speaker activity information. The results evaluated on public dataset Mini LibriSpeech, demonstrates the effectiveness of the work, reducing Diarization Error Rate from 30.95% to 28.17%. We will release the source code on GitHub to allow further research and reproducibility.
comment: Accepted to The 28th International Conference on Technologies and Applications of Artificial Intelligence (TAAI), in Chinese language
☆ AdaProj: Adaptively Scaled Angular Margin Subspace Projections for Anomalous Sound Detection with Auxiliary Classification Tasks
The state-of-the-art approach for semi-supervised anomalous sound detection is to first learn an embedding space by using auxiliary classification tasks based on meta information or self-supervised learning and then estimate the distribution of normal data. In this work, AdaProj a novel loss function is presented. In contrast to commonly used angular margin losses, which project data of each class as close as possible to their corresponding class centers, AdaProj learns to project data onto class-specific subspaces. By doing so, the resulting distributions of embeddings belonging to normal data are not required to be as restrictive as other loss functions allowing a more detailed view on the data. In experiments conducted on the DCASE2022 and DCASE2023 datasets, it is shown that using AdaProj to learn an embedding space significantly outperforms other commonly used loss functions and results in a state-of-the-art performance on the DCASE2023 dataset.
☆ emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition
Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.
comment: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402
☆ The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data
The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains. There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas, e.g., computer vision or natural language processing. A major limitation with audio is the available data; with audio being a time-dependent modality, high-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset. In this short white paper, to encourage researchers with limited access to large-datasets, the organizers first outline several open-source datasets that are available to the community, and for the duration of the workshop are making several propriety datasets available. Namely, three vocal datasets, Hume-Prosody, Hume-VocalBurst, an acted emotional speech dataset Modulate-Sonata, and an in-game streamer dataset Modulate-Stream. We outline the current baselines on these datasets but encourage researchers from across audio to utilize them outside of the initial baseline tasks.
♻ ☆ Unimodal Multi-Task Fusion for Emotional Mimicry Prediciton
In this study, we propose a methodology for the Emotional Mimicry Intensity (EMI) Estimation task within the context of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our approach leverages the Wav2Vec 2.0 framework, pre-trained on a comprehensive podcast dataset, to extract a broad range of audio features encompassing both linguistic and paralinguistic elements. We enhance feature representation through a fusion technique that integrates individual features with a global mean vector, introducing global contextual insights into our analysis. Additionally, we incorporate a pre-trained valence-arousal-dominance (VAD) module from the Wav2Vec 2.0 model. Our fusion employs a Long Short-Term Memory (LSTM) architecture for efficient temporal analysis of audio data. Utilizing only the provided audio data, our approach demonstrates significant improvements over the established baseline.
♻ ☆ RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation ICLR
Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the prior SOTA method in both inference speed and separation quality while reducing the number of parameters by 90% and MACs by 83%. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
comment: Accepted by The Twelfth International Conference on Learning Representations (ICLR) 2024, see https://openreview.net/forum?id=PEuDO2EiDr
♻ ☆ CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds ICASSP 2024
This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries - and the accompanying CryCeleb 2023 task, which is a public speaker verification challenge based on cry sounds. We released more than 6 hours of manually segmented cry sounds from 786 newborns for academic use, aiming to encourage research in infant cry analysis. The inaugural public competition attracted 59 participants, 11 of whom improved the baseline performance. The top-performing system achieved a significant improvement scoring 25.8% equal error rate, which is still far from the performance of state-of-the-art adult speaker verification systems. Therefore, we believe there is room for further research on this dataset, potentially extending beyond the verification task.
comment: ICASSP 2024
♻ ☆ Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public.
Robotics
☆ ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer
Obstacle detection and tracking represent a critical component in robot autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based model to address both obstacle detection and tracking problems. For the detection task, our approach leverages deformable attention to construct a 3D cost volume, which is decoded progressively in the form of voxel occupancy grids. We further track the obstacles by matching the voxels between consecutive frames. The entire model can be optimized in an end-to-end manner. Through extensive experiments on DrivingStereo and KITTI benchmarks, our model achieves state-of-the-art performance in the obstacle detection task. We also report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their computation cost, typically ten-fold to twenty-fold less. The code and model weights will be publicly released.
comment: 8 pages
☆ SDP Synthesis of Maximum Coverage Trees for Probabilistic Planning under Control Constraints
The paper presents Maximal Covariance Backward Reachable Trees (MAXCOVAR BRT), which is a multi-query algorithm for planning of dynamic systems under stochastic motion uncertainty and constraints on the control input with explicit coverage guarantees. In contrast to existing roadmap-based probabilistic planning methods that sample belief nodes randomly and draw edges between them \cite{csbrm_tro2024}, under control constraints, the reachability of belief nodes needs to be explicitly established and is determined by checking the feasibility of a non-convex program. Moreover, there is no explicit consideration of coverage of the roadmap while adding nodes and edges during the construction procedure for the existing methods. Our contribution is a novel optimization formulation to add nodes and construct the corresponding edge controllers such that the generated roadmap results in provably maximal coverage under control constraints as compared to any other method of adding nodes and edges. We characterize formally the notion of coverage of a roadmap in this stochastic domain via introduction of the h-$\operatorname{BRS}$ (Backward Reachable Set of Distributions) of a tree of distributions under control constraints, and also support our method with extensive simulations on a 6 DoF model.
☆ Extended Reality for Enhanced Human-Robot Collaboration: a Human-in-the-Loop Approach
The rise of automation has provided an opportunity to achieve higher efficiency in manufacturing processes, yet it often compromises the flexibility required to promptly respond to evolving market needs and meet the demand for customization. Human-robot collaboration attempts to tackle these challenges by combining the strength and precision of machines with human ingenuity and perceptual understanding. In this paper, we conceptualize and propose an implementation framework for an autonomous, machine learning-based manipulator that incorporates human-in-the-loop principles and leverages Extended Reality (XR) to facilitate intuitive communication and programming between humans and robots. Furthermore, the conceptual framework foresees human involvement directly in the robot learning process, resulting in higher adaptability and task generalization. The paper highlights key technologies enabling the proposed framework, emphasizing the importance of developing the digital ecosystem as a whole. Additionally, we review the existent implementation approaches of XR in human-robot collaboration, showcasing diverse perspectives and methodologies. The challenges and future outlooks are discussed, delving into the major obstacles and potential research avenues of XR for more natural human-robot interaction and integration in the industrial landscape.
☆ VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition
Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities. However, it is non-trivial to perform accurate image-LiDAR global place recognition since extracting consistent and robust global descriptors from different domains (2D images and 3D point clouds) is challenging. To address this issue, we propose a novel Voxel-Cross-Pixel (VXP) approach, which establishes voxel and pixel correspondences in a self-supervised manner and brings them into a shared feature space. Specifically, VXP is trained in a two-stage manner that first explicitly exploits local feature correspondences and enforces similarity of global descriptors. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate our method surpasses the state-of-the-art cross-modal retrieval by a large margin.
comment: Project page https://yunjinli.github.io/projects-vxp/
☆ Co-Optimization of Environment and Policies for Decentralized Multi-Agent Navigation
This work views the multi-agent system and its surrounding environment as a co-evolving system, where the behavior of one affects the other. The goal is to take both agent actions and environment configurations as decision variables, and optimize these two components in a coordinated manner to improve some measure of interest. Towards this end, we consider the problem of decentralized multi-agent navigation in cluttered environments. By introducing two sub-objectives of multi-agent navigation and environment optimization, we propose an $\textit{agent-environment co-optimization}$ problem and develop a $\textit{coordinated algorithm}$ that alternates between these sub-objectives to search for an optimal synthesis of agent actions and obstacle configurations in the environment; ultimately, improving the navigation performance. Due to the challenge of explicitly modeling the relation between agents, environment and performance, we leverage policy gradient to formulate a model-free learning mechanism within the coordinated framework. A formal convergence analysis shows that our coordinated algorithm tracks the local minimum trajectory of an associated time-varying non-convex optimization problem. Extensive numerical results corroborate theoretical findings and show the benefits of co-optimization over baselines. Interestingly, the results also indicate that optimized environment configurations are able to offer structural guidance that is key to de-conflicting agents in motion.
☆ Learning Hierarchical Control For Constrained Dynamic Task Assignment
This paper introduces a novel data-driven hierarchical control scheme for managing a fleet of nonlinear, capacity-constrained autonomous agents in an iterative environment. We propose a control framework consisting of a high-level dynamic task assignment and routing layer and low-level motion planning and tracking layer. Each layer of the control hierarchy uses a data-driven MPC policy, maintaining bounded computational complexity at each calculation of a new task assignment or actuation input. We utilize collected data to iteratively refine estimates of agent capacity usage, and update MPC policy parameters accordingly. Our approach leverages tools from iterative learning control to integrate learning at both levels of the hierarchy, and coordinates learning between levels in order to maintain closed-loop feasibility and performance improvement of the connected architecture.
☆ Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors
Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation. Web page: https://tsagkas.github.io/click2grasp
comment: 8 pages, 4 figures
☆ Physics-Based Causal Reasoning for Safe & Robust Next-Best Action Selection in Robot Manipulation Tasks IROS
Safe and efficient object manipulation is a key enabler of many real-world robot applications. However, this is challenging because robot operation must be robust to a range of sensor and actuator uncertainties. In this paper, we present a physics-informed causal-inference-based framework for a robot to probabilistically reason about candidate actions in a block stacking task in a partially observable setting. We integrate a physics-based simulation of the rigid-body system dynamics with a causal Bayesian network (CBN) formulation to define a causal generative probabilistic model of the robot decision-making process. Using simulation-based Monte Carlo experiments, we demonstrate our framework's ability to successfully: (1) predict block tower stability with high accuracy (Pred Acc: 88.6%); and, (2) select an approximate next-best action for the block stacking task, for execution by an integrated robot system, achieving 94.2% task success rate. We also demonstrate our framework's suitability for real-world robot systems by demonstrating successful task executions with a domestic support robot, with perception and manipulation sub-system integration. Hence, we show that by embedding physics-based causal reasoning into robots' decision-making processes, we can make robot task execution safer, more reliable, and more robust to various types of uncertainty.
comment: 8 pages, 9 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
☆ Bringing Robots Home: The Rise of AI Robots in Consumer Electronics
On March 18, 2024, NVIDIA unveiled Project GR00T, a general-purpose multimodal generative AI model designed specifically for training humanoid robots. Preceding this event, Tesla's unveiling of the Optimus Gen 2 humanoid robot on December 12, 2023, underscored the profound impact robotics is poised to have on reshaping various facets of our daily lives. While robots have long dominated industrial settings, their presence within our homes is a burgeoning phenomenon. This can be attributed, in part, to the complexities of domestic environments and the challenges of creating robots that can seamlessly integrate into our daily routines.
comment: Accepted by IEEE Consumer Electronics Magazine
☆ Exploring 3D Human Pose Estimation and Forecasting from the Robot's Perspective: The HARPER Dataset
We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and \spot, the quadruped robot manufactured by Boston Dynamics. The key-novelty is the focus on the robot's perspective, i.e., on the data captured by the robot's sensors. These make 3D body pose analysis challenging because being close to the ground captures humans only partially. The scenario underlying HARPER includes 15 actions, of which 10 involve physical contact between the robot and users. The Corpus contains not only the recordings of the built-in stereo cameras of Spot, but also those of a 6-camera OptiTrack system (all recordings are synchronized). This leads to ground-truth skeletal representations with a precision lower than a millimeter. In addition, the Corpus includes reproducible benchmarks on 3D Human Pose Estimation, Human Pose Forecasting, and Collision Prediction, all based on publicly available baseline approaches. This enables future HARPER users to rigorously compare their results with those we provide in this work.
☆ Efficient Model Learning and Adaptive Tracking Control of Magnetic Micro-Robots for Non-Contact Manipulation
Magnetic microrobots can be navigated by an external magnetic field to autonomously move within living organisms with complex and unstructured environments. Potential applications include drug delivery, diagnostics, and therapeutic interventions. Existing techniques commonly impart magnetic properties to the target object,or drive the robot to contact and then manipulate the object, both probably inducing physical damage. This paper considers a non-contact formulation, where the robot spins to generate a repulsive field to push the object without physical contact. Under such a formulation, the main challenge is that the motion model between the input of the magnetic field and the output velocity of the target object is commonly unknown and difficult to analyze. To deal with it, this paper proposes a data-driven-based solution. A neural network is constructed to efficiently estimate the motion model. Then, an approximate model-based optimal control scheme is developed to push the object to track a time-varying trajectory, maintaining the non-contact with distance constraints. Furthermore, a straightforward planner is introduced to assess the adaptability of non-contact manipulation in a cluttered unstructured environment. Experimental results are presented to show the tracking and navigation performance of the proposed scheme.
comment: 7 pages, 6 figures, received by 2024 IEEE International Conference on Robotics and Automation
☆ DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics
Deep neural network (DNN) video analytics is crucial for autonomous systems such as self-driving vehicles, unmanned aerial vehicles (UAVs), and security robots. However, real-world deployment faces challenges due to their limited computational resources and battery power. To tackle these challenges, continuous learning exploits a lightweight "student" model at deployment (inference), leverages a larger "teacher" model for labeling sampled data (labeling), and continuously retrains the student model to adapt to changing scenarios (retraining). This paper highlights the limitations in state-of-the-art continuous learning systems: (1) they focus on computations for retraining, while overlooking the compute needs for inference and labeling, (2) they rely on power-hungry GPUs, unsuitable for battery-operated autonomous systems, and (3) they are located on a remote centralized server, intended for multi-tenant scenarios, again unsuitable for autonomous systems due to privacy, network availability, and latency concerns. We propose a hardware-algorithm co-designed solution for continuous learning, DaCapo, that enables autonomous systems to perform concurrent executions of inference, labeling, and training in a performant and energy-efficient manner. DaCapo comprises (1) a spatially-partitionable and precision-flexible accelerator enabling parallel execution of kernels on sub-accelerators at their respective precisions, and (2) a spatiotemporal resource allocation algorithm that strategically navigates the resource-accuracy tradeoff space, facilitating optimal decisions for resource allocation to achieve maximal accuracy. Our evaluation shows that DaCapo achieves 6.5% and 5.5% higher accuracy than a state-of-the-art GPU-based continuous learning systems, Ekya and EOMU, respectively, while consuming 254x less power.
☆ A Comparative Study of Real-Time Implementable Cooperative Aerial Manipulation Systems
This survey paper focuses on quadrotor- and multirotor- based cooperative aerial manipulation. Emphasis is first given on comparing and evaluating prototype systems that have been implemented and tested in real-time in diverse application environments. Underlying modeling and control approaches are also discussed and compared. The outcome of the survey allows for understanding the motivation and rationale to develop such systems, their applicability and implementability in diverse applications and also challenges that need to be addressed and overcome. Moreover, the survey provides a guide to develop the next generation of prototype systems based on preferred characteristics, functionality, operability and application domain.
comment: Submitted to MDPI Drones
☆ Tell Me What You Want (What You Really, Really Want): Addressing the Expectation Gap for Goal Conveyance from Humans to Robots
Conveying human goals to autonomous systems (AS) occurs both when the system is being designed and when it is being operated. The design-step conveyance is typically mediated by robotics and AI engineers, who must appropriately capture end-user requirements and concepts of operations, while the operation-step conveyance is mediated by the design, interfaces, and behavior of the AI. However, communication can be difficult during both these periods because of mismatches in the expectations and expertise of the end-user and the roboticist, necessitating more design cycles to resolve. We examine some of the barriers in communicating system design requirements, and develop an augmentation for applied cognitive task analysis (ACTA) methods, that we call robot task analysis (RTA), pertaining specifically to the development of autonomous systems. Further, we introduce a top-down view of an underexplored area of friction between requirements communication -- implied human expectations -- utilizing a collection of work primarily from experimental psychology and social sciences. We show how such expectations can be used in conjunction with task-specific expectations and the system design process for AS to improve design team communication, alleviate barriers to user rejection, and reduce the number of design cycles.
comment: Presented at the End-User Development for Human-Robot Interaction (EUD4HRI) workshop at HRI 2024
☆ Distilling Reinforcement Learning Policies for Interpretable Robot Locomotion: Gradient Boosting Machines and Symbolic Regression
Recent advancements in reinforcement learning (RL) have led to remarkable achievements in robot locomotion capabilities. However, the complexity and ``black-box'' nature of neural network-based RL policies hinder their interpretability and broader acceptance, particularly in applications demanding high levels of safety and reliability. This paper introduces a novel approach to distill neural RL policies into more interpretable forms using Gradient Boosting Machines (GBMs), Explainable Boosting Machines (EBMs) and Symbolic Regression. By leveraging the inherent interpretability of generalized additive models, decision trees, and analytical expressions, we transform opaque neural network policies into more transparent ``glass-box'' models. We train expert neural network policies using RL and subsequently distill them into (i) GBMs, (ii) EBMs, and (iii) symbolic policies. To address the inherent distribution shift challenge of behavioral cloning, we propose to use the Dataset Aggregation (DAgger) algorithm with a curriculum of episode-dependent alternation of actions between expert and distilled policies, to enable efficient distillation of feedback control policies. We evaluate our approach on various robot locomotion gaits -- walking, trotting, bounding, and pacing -- and study the importance of different observations in joint actions for distilled policies using various methods. We train neural expert policies for 205 hours of simulated experience and distill interpretable policies with only 10 minutes of simulated interaction for each gait using the proposed method.
☆ Evaluation and Deployment of LiDAR-based Place Recognition in Dense Forests
Many LiDAR place recognition systems have been developed and tested specifically for urban driving scenarios. Their performance in natural environments such as forests and woodlands have been studied less closely. In this paper, we analyzed the capabilities of four different LiDAR place recognition systems, both handcrafted and learning-based methods, using LiDAR data collected with a handheld device and legged robot within dense forest environments. In particular, we focused on evaluating localization where there is significant translational and orientation difference between corresponding LiDAR scan pairs. This is particularly important for forest survey systems where the sensor or robot does not follow a defined road or path. Extending our analysis we then incorporated the best performing approach, Logg3dNet, into a full 6-DoF pose estimation system -- introducing several verification layers for precise registration. We demonstrated the performance of our methods in three operational modes: online SLAM, offline multi-mission SLAM map merging, and relocalization into a prior map. We evaluated these modes using data captured in forests from three different countries, achieving 80% of correct loop closures candidates with baseline distances up to 5m, and 60% up to 10m.
☆ Exosense: A Vision-Centric Scene Understanding System For Safe Exoskeleton Navigation
Exoskeletons for daily use by those with mobility impairments are being developed. They will require accurate and robust scene understanding systems. Current research has used vision to identify immediate terrain and geometric obstacles, however these approaches are constrained to detections directly in front of the user and are limited to classifying a finite range of terrain types (e.g., stairs, ramps and level-ground). This paper presents Exosense, a vision-centric scene understanding system which is capable of generating rich, globally-consistent elevation maps, incorporating both semantic and terrain traversability information. It features an elastic Atlas mapping framework associated with a visual SLAM pose graph, embedded with open-vocabulary room labels from a Vision-Language Model (VLM). The device's design includes a wide field-of-view (FoV) fisheye multi-camera system to mitigate the challenges introduced by the exoskeleton walking pattern. We demonstrate the system's robustness to the challenges of typical periodic walking gaits, and its ability to construct accurate semantically-rich maps in indoor settings. Additionally, we showcase its potential for motion planning -- providing a step towards safe navigation for exoskeletons.
comment: 8 pages, 10 figures
☆ Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation IROS2024
Sample efficient learning of manipulation skills poses a major challenge in robotics. While recent approaches demonstrate impressive advances in the type of task that can be addressed and the sensing modalities that can be incorporated, they still require large amounts of training data. Especially with regard to learning actions on robots in the real world, this poses a major problem due to the high costs associated with both demonstrations and real-world robot interactions. To address this challenge, we introduce BOpt-GMM, a hybrid approach that combines imitation learning with own experience collection. We first learn a skill model as a dynamical system encoded in a Gaussian Mixture Model from a few demonstrations. We then improve this model with Bayesian optimization building on a small number of autonomous skill executions in a sparse reward setting. We demonstrate the sample efficiency of our approach on multiple complex manipulation skills in both simulations and real-world experiments. Furthermore, we make the code and pre-trained models publicly available at http://bopt-gmm. cs.uni-freiburg.de.
comment: 7 pages, 5 figures, 2 tables, submitted to IROS2024
☆ DexDribbler: Learning Dexterous Soccer Manipulation via Dynamic Supervision IROS 2024
Learning dexterous locomotion policy for legged robots is becoming increasingly popular due to its ability to handle diverse terrains and resemble intelligent behaviors. However, joint manipulation of moving objects and locomotion with legs, such as playing soccer, receive scant attention in the learning community, although it is natural for humans and smart animals. A key challenge to solve this multitask problem is to infer the objectives of locomotion from the states and targets of the manipulated objects. The implicit relation between the object states and robot locomotion can be hard to capture directly from the training experience. We propose adding a feedback control block to compute the necessary body-level movement accurately and using the outputs as dynamic joint-level locomotion supervision explicitly. We further utilize an improved ball dynamic model, an extended context-aided estimator, and a comprehensive ball observer to facilitate transferring policy learned in simulation to the real world. We observe that our learning scheme can not only make the policy network converge faster but also enable soccer robots to perform sophisticated maneuvers like sharp cuts and turns on flat surfaces, a capability that was lacking in previous methods. Video and code are available at https://github.com/SysCV/soccer-player
comment: 8 pages, 7 figures, submitted to IROS 2024
☆ Human Reactions to Incorrect Answers from Robots
As robots grow more and more integrated into numerous industries, it is critical to comprehend how humans respond to their failures. This paper systematically studies how trust dynamics and system design are affected by human responses to robot failures. The three-stage survey used in the study provides a thorough understanding of human-robot interactions. While the second stage concentrates on interaction details, such as robot precision and error acknowledgment, the first stage collects demographic data and initial levels of trust. In the last phase, participants' perceptions are examined after the encounter, and trust dynamics, forgiveness, and propensity to suggest robotic technologies are evaluated. Results show that participants' trust in robotic technologies increased significantly when robots acknowledged their errors or limitations to participants and their willingness to suggest robots for activities in the future points to a favorable change in perception, emphasizing the role that direct engagement has in influencing trust dynamics. By providing useful advice for creating more sympathetic, responsive, and reliable robotic systems, the study advances the science of human-robot interaction and promotes a wider adoption of robotic technologies.
comment: 6 pages, 6 figures, 1 table, Ro-Man 2024
☆ UAV-Assisted Maritime Search and Rescue: A Holistic Approach
In this paper, we explore the application of Unmanned Aerial Vehicles (UAVs) in maritime search and rescue (mSAR) missions, focusing on medium-sized fixed-wing drones and quadcopters. We address the challenges and limitations inherent in operating some of the different classes of UAVs, particularly in search operations. Our research includes the development of a comprehensive software framework designed to enhance the efficiency and efficacy of SAR operations. This framework combines preliminary detection onboard UAVs with advanced object detection at ground stations, aiming to reduce visual strain and improve decision-making for operators. It will be made publicly available upon publication. We conduct experiments to evaluate various Region of Interest (RoI) proposal methods, especially by imposing simulated limited bandwidth on them, an important consideration when flying remote or offshore operations. This forces the algorithm to prioritize some predictions over others.
☆ Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
☆ ReFeree: Radar-based efficient global descriptor using a Feature and Free space for Place Recognition
Radar is highlighted for robust sensing capabilities in adverse weather conditions (e.g. dense fog, heavy rain, or snowfall). In addition, Radar can cover wide areas and penetrate small particles. Despite these advantages, Radar-based place recognition remains in the early stages compared to other sensors due to its unique characteristics such as low resolution, and significant noise. In this paper, we propose a Radarbased place recognition utilizing a descriptor called ReFeree using a feature and free space. Unlike traditional methods, we overwhelmingly summarize the Radar image. Despite being lightweight, it contains semi-metric information and is also outstanding from the perspective of place recognition performance. For concrete validation, we test a single session from the MulRan dataset and a multi-session from the Oxford Radar RobotCar and the Boreas dataset.
comment: 5 pages, 4 figures
☆ HCTO: Optimality-Aware LiDAR Inertial Odometry with Hybrid Continuous Time Optimization for Compact Wearable Mapping System
Compact wearable mapping system (WMS) has gained significant attention due to their convenience in various applications. Specifically, it provides an efficient way to collect prior maps for 3D structure inspection and robot-based "last-mile delivery" in complex environments. However, vibrations in human motion and the uneven distribution of point cloud features in complex environments often lead to rapid drift, which is a prevalent issue when applying existing LiDAR Inertial Odometry (LIO) methods on low-cost WMS. To address these limitations, we propose a novel LIO for WMSs based on Hybrid Continuous Time Optimization (HCTO) considering the optimality of Lidar correspondences. First, HCTO recognizes patterns in human motion (high-frequency part, low-frequency part, and constant velocity part) by analyzing raw IMU measurements. Second, HCTO constructs hybrid IMU factors according to different motion states, which enables robust and accurate estimation against vibration-induced noise in the IMU measurements. Third, the best point correspondences are selected using optimal design to achieve real-time performance and better odometry accuracy. We conduct experiments on head-mounted WMS datasets to evaluate the performance of our system, demonstrating significant advantages over state-of-the-art methods. Video recordings of experiments can be found on the project page of HCTO: \href{https://github.com/kafeiyin00/HCTO}{https://github.com/kafeiyin00/HCTO}.
☆ Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation
Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage (https://sunleyuan.github.io/ObjectNav).
comment: will soon submit to the Elsevier journal, Advanced Engineering Informatics
☆ Extrinsic Calibration of Multiple LiDARs for a Mobile Robot based on Floor Plane And Object Segmentation
Mobile robots equipped with multiple light detection and ranging (LiDARs) and capable of recognizing their surroundings are increasing due to the minitualization and cost reduction of LiDAR. This paper proposes a target-less extrinsic calibration method of multiple LiDARs with non-overlapping field of view (FoV). The proposed method uses accumulated point clouds of floor plane and objects while in motion. It enables accurate calibration with challenging configuration of LiDARs that directed towards the floor plane, caused by biased feature values. Additionally, the method includes a noise removal module that considers the scanning pattern to address bleeding points, which are noises of significant source of error in point cloud alignment using high-density LiDARs. Evaluations through simulation demonstrate that the proposed method achieved higher accuracy extrinsic calibration with two and four LiDARs than conventional methods, regardless type of objects. Furthermore, the experiments using a real mobile robot has shown that our proposed noise removal module can eliminate noise more precisely than conventional methods, and the estimated extrinsic parameters have successfully created consistent 3D maps.
comment: 8pages, 10figures
☆ Development of a Compact Robust Passive Transformable Omni-Ball for Enhanced Step-Climbing and Vibration Reduction
This paper introduces the Passive Transformable Omni-Ball (PTOB), an advanced omnidirectional wheel engineered to enhance step-climbing performance, incorporate built-in actuators, diminish vibrations, and fortify structural integrity. By modifying the omni-ball's structure from two to three segments, we have achieved improved in-wheel actuation and a reduction in vibrational feedback. Additionally, we have implemented a sliding mechanism in the follower wheels to boost the wheel's step-climbing abilities. A prototype with a 127 mm diameter PTOB was constructed, which confirmed its functionality for omnidirectional movement and internal actuation. Compared to a traditional omni-wheel, the PTOB demonstrated a comparable level of vibration while offering superior capabilities. Extensive testing in varied settings showed that the PTOB can adeptly handle step obstacles up to 45 mm, equivalent to 35 $\%$ of the wheel's diameter, in both the forward and lateral directions. The PTOB showcased robust construction and proved to be versatile in navigating through environments with diverse obstacles.
comment: 8 pages, 16 figures
☆ Robust Locomotion via Zero-order Stochastic Nonlinear Model Predictive Control with Guard Saltation Matrix
This paper presents a stochastic/robust nonlinear model predictive control (NMPC) to enhance the robustness of legged locomotion against contact uncertainties. We integrate the contact uncertainties into the covariance propagation of stochastic/robust NMPC framework by leveraging the guard saltation matrix and an extended Kalman filter-like covariance update. We achieve fast stochastic/robust NMPC computation by utilizing the zero-order stochastic/robust NMPC algorithm with additional improvements in computational efficiency concerning the feedback gains. We conducted numerical experiments and demonstrate that the proposed method can accurately forecast future state covariance and generate trajectories that satisfies constraints even in the presence of the contact uncertainties. Hardware experiments on the perceptive locomotion of a wheeled-legged robot were also carried out, validating the feasibility of the proposed method in a real-world system with limited on-board computation.
comment: 8 pages, 8 figures
☆ Evidential Semantic Mapping in Off-road Environments with Uncertainty-aware Bayesian Kernel Inference
Robotic mapping with Bayesian Kernel Inference (BKI) has shown promise in creating semantic maps by effectively leveraging local spatial information. However, existing semantic mapping methods face challenges in constructing reliable maps in unstructured outdoor scenarios due to unreliable semantic predictions. To address this issue, we propose an evidential semantic mapping, which can enhance reliability in perceptually challenging off-road environments. We integrate Evidential Deep Learning into the semantic segmentation network to obtain the uncertainty estimate of semantic prediction. Subsequently, this semantic uncertainty is incorporated into an uncertainty-aware BKI, tailored to prioritize more confident semantic predictions when accumulating semantic information. By adaptively handling semantic uncertainties, the proposed framework constructs robust representations of the surroundings even in previously unseen environments. Comprehensive experiments across various off-road datasets demonstrate that our framework enhances accuracy and robustness, consistently outperforming existing methods in scenes with high perceptual uncertainties.
comment: Our project website can be found at https://kjyoung.github.io/Homepage/#/Projects/Evidential-Semantic-Mapping
☆ Semantics from Space: Satellite-Guided Thermal Semantic Segmentation Annotation for Aerial Field Robots
We present a new method to automatically generate semantic segmentation annotations for thermal imagery captured from an aerial vehicle by utilizing satellite-derived data products alongside onboard global positioning and attitude estimates. This new capability overcomes the challenge of developing thermal semantic perception algorithms for field robots due to the lack of annotated thermal field datasets and the time and costs of manual annotation, enabling precise and rapid annotation of thermal data from field collection efforts at a massively-parallelizable scale. By incorporating a thermal-conditioned refinement step with visual foundation models, our approach can produce highly-precise semantic segmentation labels using low-resolution satellite land cover data for little-to-no cost. It achieves 98.5% of the performance from using costly high-resolution options and demonstrates between 70-160% improvement over popular zero-shot semantic segmentation methods based on large vision-language models currently used for generating annotations for RGB imagery. Code will be available at: https://github.com/connorlee77/aerial-auto-segment.
☆ A Roadmap Towards Automated and Regulated Robotic Systems
The rapid development of generative technology opens up possibility for higher level of automation, and artificial intelligence (AI) embodiment in robotic systems is imminent. However, due to the blackbox nature of the generative technology, the generation of the knowledge and workflow scheme is uncontrolled, especially in a dynamic environment and a complex scene. This poses challenges to regulations in safety-demanding applications such as medical scenes. We argue that the unregulated generative processes from AI is fitted for low level end tasks, but intervention in the form of manual or automated regulation should happen post-workflow-generation and pre-robotic-execution. To address this, we propose a roadmap that can lead to fully automated and regulated robotic systems. In this paradigm, the high level policies are generated as structured graph data, enabling regulatory oversight and reusability, while the code base for lower level tasks is generated by generative models. Our approach aims the transitioning from expert knowledge to regulated action, akin to the iterative processes of study, practice, scrutiny, and execution in human tasks. We identify the generative and deterministic processes in a design cycle, where generative processes serve as a text-based world simulator and the deterministic processes generate the executable system. We propose State Machine Seralization Language (SMSL) to be the conversion point between text simulator and executable workflow control. From there, we analyze the modules involved based on the current literature, and discuss human in the loop. As a roadmap, this work identifies the current possible implementation and future work. This work does not provide an implemented system but envisions to inspire the researchers working on the direction in the roadmap. We implement the SMSL and D-SFO paradigm that serve as the starting point of the roadmap.
comment: 17 pages, 9 figures
☆ GelLink: A Compact Multi-phalanx Finger with Vision-based Tactile Sensing and Proprioception ICRA 2024
Compared to fully-actuated robotic end-effectors, underactuated ones are generally more adaptive, robust, and cost-effective. However, state estimation for underactuated hands is usually more challenging. Vision-based tactile sensors, like Gelsight, can mitigate this issue by providing high-resolution tactile sensing and accurate proprioceptive sensing. As such, we present GelLink, a compact, underactuated, linkage-driven robotic finger with low-cost, high-resolution vision-based tactile sensing and proprioceptive sensing capabilities. In order to reduce the amount of embedded hardware, i.e. the cameras and motors, we optimize the linkage transmission with a planar linkage mechanism simulator and develop a planar reflection simulator to simplify the tactile sensing hardware. As a result, GelLink only requires one motor to actuate the three phalanges, and one camera to capture tactile signals along the entire finger. Overall, GelLink is a compact robotic finger that shows adaptability and robustness when performing grasping tasks. The integration of vision-based tactile sensors can significantly enhance the capabilities of underactuated fingers and potentially broaden their future usage.
comment: Supplement video: https://www.youtube.com/watch?v=hZwUpAig5C0 . 7 pages, 9 figures. ICRA 2024 (IEEE International Conference on Robotics and Automation)
☆ Learning to Change: Choreographing Mixed Traffic Through Lateral Control and Hierarchical Reinforcement Learning
The management of mixed traffic that consists of robot vehicles (RVs) and human-driven vehicles (HVs) at complex intersections presents a multifaceted challenge. Traditional signal controls often struggle to adapt to dynamic traffic conditions and heterogeneous vehicle types. Recent advancements have turned to strategies based on reinforcement learning (RL), leveraging its model-free nature, real-time operation, and generalizability over different scenarios. We introduce a hierarchical RL framework to manage mixed traffic through precise longitudinal and lateral control of RVs. Our proposed hierarchical framework combines the state-of-the-art mixed traffic control algorithm as a high level decision maker to improve the performance and robustness of the whole system. Our experiments demonstrate that the framework can reduce the average waiting time by up to 54% compared to the state-of-the-art mixed traffic control method. When the RV penetration rate exceeds 60%, our technique consistently outperforms conventional traffic signal control programs in terms of the average waiting time for all vehicles at the intersection.
☆ TEeVTOL: Balancing Energy and Time Efficiency in eVTOL Aircraft Path Planning Across City-Scale Wind Fields
Electric vertical-takeoff and landing (eVTOL) aircraft, recognized for their maneuverability and flexibility, offer a promising alternative to our transportation system. However, the operational effectiveness of these aircraft faces many challenges, such as the delicate balance between energy and time efficiency, stemming from unpredictable environmental factors, including wind fields. Mathematical modeling-based approaches have been adopted to plan aircraft flight path in urban wind fields with the goal to save energy and time costs. While effective, they are limited in adapting to dynamic and complex environments. To optimize energy and time efficiency in eVTOL's flight through dynamic wind fields, we introduce a novel path planning method leveraging deep reinforcement learning. We assess our method with extensive experiments, comparing it to Dijkstra's algorithm -- the theoretically optimal approach for determining shortest paths in a weighted graph, where weights represent either energy or time cost. The results show that our method achieves a graceful balance between energy and time efficiency, closely resembling the theoretically optimal values for both objectives.
☆ Learning Quadruped Locomotion Using Differentiable Simulation
While most recent advancements in legged robot control have been driven by model-free reinforcement learning, we explore the potential of differentiable simulation. Differentiable simulation promises faster convergence and more stable training by computing low-variant first-order gradients using the robot model, but so far, its use for legged robot control has remained limited to simulation. The main challenge with differentiable simulation lies in the complex optimization landscape of robotic tasks due to discontinuities in contact-rich environments, e.g., quadruped locomotion. This work proposes a new, differentiable simulation framework to overcome these challenges. The key idea involves decoupling the complex whole-body simulation, which may exhibit discontinuities due to contact, into two separate continuous domains. Subsequently, we align the robot state resulting from the simplified model with a more precise, non-differentiable simulator to maintain sufficient simulation accuracy. Our framework enables learning quadruped walking in minutes using a single simulated robot without any parallelization. When augmented with GPU parallelization, our approach allows the quadruped robot to master diverse locomotion skills, including trot, pace, bound, and gallop, on challenging terrains in minutes. Additionally, our policy achieves robust locomotion performance in the real world zero-shot. To the best of our knowledge, this work represents the first demonstration of using differentiable simulation for controlling a real quadruped robot. This work provides several important insights into using differentiable simulations for legged locomotion in the real world.
☆ Multi-agent Task-Driven Exploration via Intelligent Map Compression and Sharing
This paper investigates the task-driven exploration of unknown environments with mobile sensors communicating compressed measurements. The sensors explore the area and transmit their compressed data to another robot, assisting it in reaching a goal location. We propose a novel communication framework and a tractable multi-agent exploration algorithm to select the sensors' actions. The algorithm uses a task-driven measure of uncertainty, resulting from map compression, as a reward function. We validate the efficacy of our algorithm through numerical simulations conducted on a realistic map and compare it with two alternative approaches. The results indicate that the proposed algorithm effectively decreases the time required for the robot to reach its target without causing excessive load on the communication network.
☆ Multiple and Gyro-Free Inertial Datasets
An inertial navigation system (INS) utilizes three orthogonal accelerometers and gyroscopes to determine platform position, velocity, and orientation. There are countless applications for INS, including robotics, autonomous platforms, and the internet of things. Recent research explores the integration of data-driven methods with INS, highlighting significant innovations, improving accuracy and efficiency. Despite the growing interest in this field and the availability of INS datasets, no datasets are available for gyro-free INS (GFINS) and multiple inertial measurement unit (MIMU) architectures. To fill this gap and to stimulate further research in this field, we designed and recorded GFINS and MIMU datasets using 54 inertial sensors grouped in nine inertial measurement units. These sensors can be used to define and evaluate different types of MIMU and GFINS architectures. The inertial sensors were arranged in three different sensor configurations and mounted on a mobile robot and a passenger car. In total, the dataset contains 35 hours of inertial data and corresponding ground truth trajectories. The data and code are freely accessible through our GitHub repository.
comment: 10 pages, 16 figures, 6 tables
♻ ☆ TD-MPC2: Scalable, Robust World Models for Continuous Control ICLR 2024
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com
comment: ICLR 2024. Explore videos, models, data, code, and more at https://tdmpc2.com
♻ ☆ A Modular Aerial System Based on Homogeneous Quadrotors with Fault-Tolerant Control ICRA2024
The standard quadrotor is one of the most popular and widely used aerial vehicle of recent decades, offering great maneuverability with mechanical simplicity. However, the under-actuation characteristic limits its applications, especially when it comes to generating desired wrench with six degrees of freedom (DOF). Therefore, existing work often compromises between mechanical complexity and the controllable DOF of the aerial system. To take advantage of the mechanical simplicity of a standard quadrotor, we propose a modular aerial system, IdentiQuad, that combines only homogeneous quadrotor-based modules. Each IdentiQuad can be operated alone like a standard quadrotor, but at the same time allows task-specific assembly, increasing the controllable DOF of the system. Each module is interchangeable within its assembly. We also propose a general controller for different configurations of assemblies, capable of tolerating rotor failures and balancing the energy consumption of each module. The functionality and robustness of the system and its controller are validated using physics-based simulations for different assembly configurations.
comment: ICRA2024
♻ ☆ Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation
As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of ``getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success). Our code will be made publicly available at https://github.com/XiaohanLei/IEVE.
♻ ☆ Learning a Depth Covariance Function CVPR 2023
We propose learning a depth covariance function with applications to geometric vision tasks. Given RGB images as input, the covariance function can be flexibly used to define priors over depth functions, predictive distributions given observations, and methods for active point selection. We leverage these techniques for a selection of downstream tasks: depth completion, bundle adjustment, and monocular dense visual odometry.
comment: CVPR 2023. Project page: https://edexheim.github.io/DepthCov/
♻ ☆ Deep learning reduces sensor requirements for gust rejection on a small uncrewed aerial vehicle morphing wing
There is a growing need for uncrewed aerial vehicles (UAVs) to operate in cities. However, the uneven urban landscape and complex street systems cause large-scale wind gusts that challenge the safe and effective operation of UAVs. Current gust alleviation methods rely on traditional control surfaces and computationally expensive modeling to select a control action, leading to a slower response. Here, we used deep reinforcement learning to create an autonomous gust alleviation controller for a camber-morphing wing. This method reduced gust impact by 84%, directly from real-time, on-board pressure signals. Notably, we found that gust alleviation using signals from only three pressure taps was statistically indistinguishable from using six signals. This reduced-sensor fly-by-feel control opens the door to UAV missions in previously inoperable locations.
♻ ☆ Exploring Human's Gender Perception and Bias toward Non-Humanoid Robots
As non-humanoid robots increasingly permeate various sectors, understanding their design implications for human acceptance becomes paramount. Despite their ubiquity, studies on how to improve human interaction are sparse. Our investigation, conducted through two surveys, addresses this gap. The first survey emphasizes non-humanoid robots and human perceptions about gender attributions, suggesting that both design and perceived gender influence acceptance. Survey 2 investigates the effects of varying gender cues on robot designs and their consequent impacts on human-robot interactions. Our findings highlighted that distinct gender cues can bolster or impede interaction comfort.
♻ ☆ Star-Searcher: A Complete and Efficient Aerial System for Autonomous Target Search in Complex Unknown Environments
This paper tackles the challenge of autonomous target search using unmanned aerial vehicles (UAVs) in complex unknown environments. To fill the gap in systematic approaches for this task, we introduce Star-Searcher, an aerial system featuring specialized sensor suites, mapping, and planning modules to optimize searching. Path planning challenges due to increased inspection requirements are addressed through a hierarchical planner with a visibility-based viewpoint clustering method. This simplifies planning by breaking it into global and local sub-problems, ensuring efficient global and local path coverage in real-time. Furthermore, our global path planning employs a history-aware mechanism to reduce motion inconsistency from frequent map changes, significantly enhancing search efficiency. We conduct comparisons with state-of-the-art methods in both simulation and the real world, demonstrating shorter flight paths, reduced time, and higher target search completeness. Our approach will be open-sourced for community benefit at https://github.com/SYSU-STAR/STAR-Searcher.
comment: Aceepted to IEEE RA-L. Code: https://github.com/SYSU-STAR/STAR-Searcher. Video: https://www.youtube.com/watch?v=08ll_oo_DtU
♻ ☆ Large Language Models for Multi-Modal Human-Robot Interaction
This paper presents an innovative large language model (LLM)-based robotic system for enhancing multi-modal human-robot interaction (HRI). Traditional HRI systems relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, our system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomics" for actions and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach.
comment: 10 pages, 6 figures
♻ ☆ MAkEable: Memory-centered and Affordance-based Task Execution Framework for Transferable Mobile Manipulation Skills
To perform versatile mobile manipulation tasks in human-centered environments, the ability to efficiently transfer learned tasks and experiences from one robot to another or across different environments is key. In this paper, we present MAkEable, a versatile uni- and multi-manual mobile manipulation framework that facilitates the transfer of capabilities and knowledge across different tasks, environments, and robots. Our framework integrates an affordance-based task description into the memory-centric cognitive architecture of the ARMAR humanoid robot family, which supports the sharing of experiences and demonstrations for transfer learning. By representing mobile manipulation actions through affordances, i.e., interaction possibilities of the robot with its environment, we provide a unifying framework for the autonomous uni- and multi-manual manipulation of known and unknown objects in various environments. We demonstrate the applicability of the framework in real-world experiments for multiple robots, tasks, and environments. This includes grasping known and unknown objects, object placing, bimanual object grasping, memory-enabled skill transfer in a drawer opening scenario across two different humanoid robots, and a pouring task learned from human demonstration.
♻ ☆ Driving Animatronic Robot Facial Expression From Speech
Animatronic robots aim to enable natural human-robot interaction through lifelike facial expressions. However, generating realistic, speech-synchronized robot expressions is challenging due to the complexities of facial biomechanics and responsive motion synthesis. This paper presents a principled, skinning-centric approach to drive animatronic robot facial expressions from speech. The proposed approach employs linear blend skinning (LBS) as the core representation to guide tightly integrated innovations in embodiment design and motion synthesis. LBS informs the actuation topology, enables human expression retargeting, and allows speech-driven facial motion generation. The proposed approach is capable of generating highly realistic, real-time facial expressions from speech on an animatronic face, significantly advancing robots' ability to replicate nuanced human expressions for natural interaction.
comment: Under review. For associated project page, see https://library87.github.io/animatronic-face-iros24
♻ ☆ ROS-Causal: A ROS-based Causal Analysis Framework for Human-Robot Interaction Applications
Deploying robots in human-shared spaces requires understanding interactions among nearby agents and objects. Modelling cause-and-effect relations through causal inference aids in predicting human behaviours and anticipating robot interventions. However, a critical challenge arises as existing causal discovery methods currently lack an implementation inside the ROS ecosystem, the standard de facto in robotics, hindering effective utilisation in robotics. To address this gap, this paper introduces ROS-Causal, a ROS-based framework for onboard data collection and causal discovery in human-robot spatial interactions. An ad-hoc simulator, integrated with ROS, illustrates the approach's effectiveness, showcasing the robot onboard generation of causal models during data collection. ROS-Causal is available on GitHub: https://github.com/lcastri/roscausal.git.
comment: Accepted by the "Causal-HRI: Causal Learning for Human-Robot Interaction" workshop at the 2024 ACM/IEEE International Conference on Human-Robot Interaction (HRI)
♻ ☆ Exploring Large Language Models to Facilitate Variable Autonomy for Human-Robot Teaming
In a rapidly evolving digital landscape autonomous tools and robots are becoming commonplace. Recognizing the significance of this development, this paper explores the integration of Large Language Models (LLMs) like Generative pre-trained transformer (GPT) into human-robot teaming environments to facilitate variable autonomy through the means of verbal human-robot communication. In this paper, we introduce a novel framework for such a GPT-powered multi-robot testbed environment, based on a Unity Virtual Reality (VR) setting. This system allows users to interact with robot agents through natural language, each powered by individual GPT cores. By means of OpenAI's function calling, we bridge the gap between unstructured natural language input and structure robot actions. A user study with 12 participants explores the effectiveness of GPT-4 and, more importantly, user strategies when being given the opportunity to converse in natural language within a multi-robot environment. Our findings suggest that users may have preconceived expectations on how to converse with robots and seldom try to explore the actual language and cognitive capabilities of their robot collaborators. Still, those users who did explore where able to benefit from a much more natural flow of communication and human-like back-and-forth. We provide a set of lessons learned for future research and technical implementations of similar systems.
comment: Frontiers in Robotics and AI, Variable Autonomy for Human-Robot Teaming
♻ ☆ SLIM: Skill Learning with Multiple Critics ICRA 2024
Self-supervised skill learning aims to acquire useful behaviors that leverage the underlying dynamics of the environment. Latent variable models, based on mutual information maximization, have been successful in this task but still struggle in the context of robotic manipulation. As it requires impacting a possibly large set of degrees of freedom composing the environment, mutual information maximization fails alone in producing useful and safe manipulation behaviors. Furthermore, tackling this by augmenting skill discovery rewards with additional rewards through a naive combination might fail to produce desired behaviors. To address this limitation, we introduce SLIM, a multi-critic learning approach for skill discovery with a particular focus on robotic manipulation. Our main insight is that utilizing multiple critics in an actor-critic framework to gracefully combine multiple reward functions leads to a significant improvement in latent-variable skill discovery for robotic manipulation while overcoming possible interference occurring among rewards which hinders convergence to useful skills. Furthermore, in the context of tabletop manipulation, we demonstrate the applicability of our novel skill discovery approach to acquire safe and efficient motor primitives in a hierarchical reinforcement learning fashion and leverage them through planning, significantly surpassing baseline approaches for skill discovery.
comment: Accepted at IEEE ICRA 2024
♻ ☆ R2SNet: Scalable Domain Adaptation for Object Detection in Cloud-Based Robots Ecosystems via Proposal Refinement
We introduce a novel approach for scalable domain adaptation in cloud robotics scenarios where robots rely on third-party AI inference services powered by large pre-trained deep neural networks. Our method is based on a downstream proposal-refinement stage running locally on the robots, exploiting a new lightweight DNN architecture, R2SNet. This architecture aims to mitigate performance degradation from domain shifts by adapting the object detection process to the target environment, focusing on relabeling, rescoring, and suppression of bounding-box proposals. Our method allows for local execution on robots, addressing the scalability challenges of domain adaptation without incurring significant computational costs. Real-world results on mobile service robots performing door detection show the effectiveness of the proposed method in achieving scalable domain adaptation.
♻ ☆ CoBRA: A Composable Benchmark for Robotics Applications ICRA'24
Selecting an optimal robot, its base pose, and trajectory for a given task is currently mainly done by human expertise or trial and error. To evaluate automatic approaches to this combined optimization problem, we introduce a benchmark suite encompassing a unified format for robots, environments, and task descriptions. Our benchmark suite is especially useful for modular robots, where the multitude of robots that can be assembled creates a host of additional parameters to optimize. We include tasks such as machine tending and welding in synthetic environments and 3D scans of real-world machine shops. All benchmarks are accessible through https://cobra.cps.cit.tum.de, a platform to conveniently share, reference, and compare tasks, robot models, and solutions.
comment: 7 pages, 5 Figures, 5 Tables Final version for IEEE ICRA'24
♻ ☆ Generalized Early Stopping in Evolutionary Direct Policy Search
Lengthy evaluation times are common in many optimization problems such as direct policy search tasks, especially when they involve conducting evaluations in the physical world, e.g. in robotics applications. Often when evaluating solution over a fixed time period it becomes clear that the objective value will not increase with additional computation time (for example when a two wheeled robot continuously spins on the spot). In such cases, it makes sense to stop the evaluation early to save computation time. However, most approaches to stop the evaluation are problem specific and need to be specifically designed for the task at hand. Therefore, we propose an early stopping method for direct policy search. The proposed method only looks at the objective value at each time step and requires no problem specific knowledge. We test the introduced stopping criterion in five direct policy search environments drawn from games, robotics and classic control domains, and show that it can save up to 75% of the computation time. We also compare it with problem specific stopping criteria and show that it performs comparably, while being more generally applicable.
♻ ☆ Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework
The socially-aware navigation system has evolved to adeptly avoid various obstacles while performing multiple tasks, such as point-to-point navigation, human-following, and -guiding. However, a prominent gap persists: in Human-Robot Interaction (HRI), the procedure of communicating commands to robots demands intricate mathematical formulations. Furthermore, the transition between tasks does not quite possess the intuitive control and user-centric interactivity that one would desire. In this work, we propose an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field. We achieve this by first introducing a multimodal interaction framework where language and hand-drawn inputs can serve as navigation constraints and control objectives. Next, a reinforcement learning agent is built to handle multiple tasks with the received information. Crucially, LIM2N creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system. Extensive experiments are conducted in both simulation and the real world demonstrating that LIM2N has superior user needs understanding, alongside an enhanced interactive experience.
♻ ☆ Real-time Perceptive Motion Control using Control Barrier Functions with Analytical Smoothing for Six-Wheeled-Telescopic-Legged Robot Tachyon 3
To achieve safe legged locomotion, it is important to generate motion in real-time considering various constraints in robots and environments. In this study, we propose a lightweight real-time perspective motion control system for the newly developed six-wheeled-telescopic-legged robot, Tachyon 3. In the proposed method, analytically smoothed constraints including Smooth Separating Axis Theorem (Smooth SAT) as a novel higher order differentiable collision detection for 3D shapes is applied to the Control Barrier Function (CBF). The proposed system integrating the CBF achieves online motion generation in a short control cycle of 1 ms that satisfies joint limitations, environmental collision avoidance and safe convex foothold constraints. The efficiency of Smooth SAT is shown from the collision detection time of 1 us or less and the CBF constraint computation time for Tachyon3 of several us. Furthermore, the effectiveness of the proposed system is verified through the stair-climbing motion, integrating online recognition in a simulation and a real machine.
comment: 8 pages, 8 figures, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections
Today's robot policies exhibit subpar performance when faced with the challenge of generalizing to novel environments. Human corrective feedback is a crucial form of guidance to enable such generalization. However, adapting to and learning from online human corrections is a non-trivial endeavor: not only do robots need to remember human feedback over time to retrieve the right information in new settings and reduce the intervention rate, but also they would need to be able to respond to feedback that can be arbitrary corrections about high-level human preferences to low-level adjustments to skill parameters. In this work, we present Distillation and Retrieval of Online Corrections (DROC), a large language model (LLM)-based system that can respond to arbitrary forms of language feedback, distill generalizable knowledge from corrections, and retrieve relevant past experiences based on textual and visual similarity for improving performance in novel settings. DROC is able to respond to a sequence of online language corrections that address failures in both high-level task plans and low-level skill primitives. We demonstrate that DROC effectively distills the relevant information from the sequence of online corrections in a knowledge base and retrieves that knowledge in settings with new task or object instances. DROC outperforms other techniques that directly generate robot code via LLMs by using only half of the total number of corrections needed in the first round and requires little to no corrections after two iterations. We show further results, videos, prompts and code on https://sites.google.com/stanford.edu/droc .
comment: 8 pages, 4 figures, videos and code links on website https://sites.google.com/stanford.edu/droc
♻ ☆ Deep Learning for Inertial Positioning: A Survey
Inertial sensors are widely utilized in smartphones, drones, robots, and IoT devices, playing a crucial role in enabling ubiquitous and reliable localization. Inertial sensor-based positioning is essential in various applications, including personal navigation, location-based security, and human-device interaction. However, low-cost MEMS inertial sensors' measurements are inevitably corrupted by various error sources, leading to unbounded drifts when integrated doubly in traditional inertial navigation algorithms, subjecting inertial positioning to the problem of error drifts. In recent years, with the rapid increase in sensor data and computational power, deep learning techniques have been developed, sparking significant research into addressing the problem of inertial positioning. Relevant literature in this field spans across mobile computing, robotics, and machine learning. In this article, we provide a comprehensive review of deep learning-based inertial positioning and its applications in tracking pedestrians, drones, vehicles, and robots. We connect efforts from different fields and discuss how deep learning can be applied to address issues such as sensor calibration, positioning error drift reduction, and multi-sensor fusion. This article aims to attract readers from various backgrounds, including researchers and practitioners interested in the potential of deep learning-based techniques to solve inertial positioning problems. Our review demonstrates the exciting possibilities that deep learning brings to the table and provides a roadmap for future research in this field.
comment: Accepted by IEEE Transactions on Intelligent Transportation Systems
♻ ☆ AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving
As an essential task in autonomous driving (AD), motion prediction aims to predict the future states of surround objects for navigation. One natural solution is to estimate the position of other agents in a step-by-step manner where each predicted time-step is conditioned on both observed time-steps and previously predicted time-steps, i.e., autoregressive prediction. Pioneering works like SocialLSTM and MFP design their decoders based on this intuition. However, almost all state-of-the-art works assume that all predicted time-steps are independent conditioned on observed time-steps, where they use a single linear layer to generate positions of all time-steps simultaneously. They dominate most motion prediction leaderboards due to the simplicity of training MLPs compared to autoregressive networks. In this paper, we introduce the GPT style next token prediction into motion forecasting. In this way, the input and output could be represented in a unified space and thus the autoregressive prediction becomes more feasible. However, different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. To this end, we propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations, e.g., encoding the transformation between coordinate systems for spatial relativity while adopting RoPE for temporal relativity. Empirically, by equipping with the aforementioned tailored designs, the proposed method achieves state-of-the-art performance in the Waymo Open Motion and Waymo Interaction datasets. Notably, AMP outperforms other recent autoregressive motion prediction methods: MotionLM and StateTransformer, which demonstrates the effectiveness of the proposed designs.
♻ ☆ Redundancy parameterization and inverse kinematics of 7-DOF revolute manipulators
Seven degree-of-freedom (DOF) robot arms have one redundant DOF which does not change the motion of the end effector. The redundant DOF offers greater manipulability of the arm configuration to avoid obstacles and singularities, but it must be parameterized to fully specify the joint angles for a given end effector pose. For 7-DOF revolute (7R) manipulators, we introduce a new concept of generalized shoulder-elbow-wrist (SEW) angle, a generalization of the conventional SEW angle but with an arbitrary choice of the reference direction function. The SEW angle is widely used and easy for human operators to visualize as a rotation of the elbow about the shoulder-wrist line. Since other redundancy parameterizations including the conventional SEW angle encounter an algorithmic singularity along a line in the workspace, we introduce a special choice of the reference direction function called the stereographic SEW angle which has a singularity only along a half-line, which can be placed out of reach. We prove that such a singularity is unavoidable for any parameterization. We also include expressions for the SEW angle Jacobian along with singularity analysis. Finally, we provide efficient and singularity-robust inverse kinematics solutions for most known 7R manipulators using the general SEW angle and the subproblem decomposition method. These solutions are often closed-form but may sometimes involve a 1D or 2D search in the general case. Search-based solutions may be converted to finding zeros of a high-order polynomial. Inverse kinematics solutions, examples, and evaluations are available in a publicly accessible repository.
comment: 22 pages, 14 figures. Update: Sawyer IK using polynomial method, two video extensions, expanded related literature
♻ ☆ Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
comment: Project website: https://robotics-transformer-x.github.io
♻ ☆ TeleMoMa: A Modular and Versatile Teleoperation System for Mobile Manipulation
A critical bottleneck limiting imitation learning in robotics is the lack of data. This problem is more severe in mobile manipulation, where collecting demonstrations is harder than in stationary manipulation due to the lack of available and easy-to-use teleoperation interfaces. In this work, we demonstrate TeleMoMa, a general and modular interface for whole-body teleoperation of mobile manipulators. TeleMoMa unifies multiple human interfaces including RGB and depth cameras, virtual reality controllers, keyboard, joysticks, etc., and any combination thereof. In its more accessible version, TeleMoMa works using simply vision (e.g., an RGB-D camera), lowering the entry bar for humans to provide mobile manipulation demonstrations. We demonstrate the versatility of TeleMoMa by teleoperating several existing mobile manipulators - PAL Tiago++, Toyota HSR, and Fetch - in simulation and the real world. We demonstrate the quality of the demonstrations collected with TeleMoMa by training imitation learning policies for mobile manipulation tasks involving synchronized whole-body motion. Finally, we also show that TeleMoMa's teleoperation channel enables teleoperation on site, looking at the robot, or remote, sending commands and observations through a computer network, and perform user studies to evaluate how easy it is for novice users to learn to collect demonstrations with different combinations of human interfaces enabled by our system. We hope TeleMoMa becomes a helpful tool for the community enabling researchers to collect whole-body mobile manipulation demonstrations. For more information and video results, https://robin-lab.cs.utexas.edu/telemoma-web.
♻ ☆ OASIS: Optimal Arrangements for Sensing in SLAM
The number and arrangement of sensors on mobile robot dramatically influence its perception capabilities. Ensuring that sensors are mounted in a manner that enables accurate detection, localization, and mapping is essential for the success of downstream control tasks. However, when designing a new robotic platform, researchers and practitioners alike usually mimic standard configurations or maximize simple heuristics like field-of-view (FOV) coverage to decide where to place exteroceptive sensors. In this work, we conduct an information-theoretic investigation of this overlooked element of robotic perception in the context of simultaneous localization and mapping (SLAM). We show how to formalize the sensor arrangement problem as a form of subset selection under the E-optimality performance criterion. While this formulation is NP-hard in general, we show that a combination of greedy sensor selection and fast convex relaxation-based post-hoc verification enables the efficient recovery of certifiably optimal sensor designs in practice. Results from synthetic experiments reveal that sensors placed with OASIS outperform benchmarks in terms of mean squared error of visual SLAM estimates.
Systems and Control
☆ SDP Synthesis of Maximum Coverage Trees for Probabilistic Planning under Control Constraints
The paper presents Maximal Covariance Backward Reachable Trees (MAXCOVAR BRT), which is a multi-query algorithm for planning of dynamic systems under stochastic motion uncertainty and constraints on the control input with explicit coverage guarantees. In contrast to existing roadmap-based probabilistic planning methods that sample belief nodes randomly and draw edges between them \cite{csbrm_tro2024}, under control constraints, the reachability of belief nodes needs to be explicitly established and is determined by checking the feasibility of a non-convex program. Moreover, there is no explicit consideration of coverage of the roadmap while adding nodes and edges during the construction procedure for the existing methods. Our contribution is a novel optimization formulation to add nodes and construct the corresponding edge controllers such that the generated roadmap results in provably maximal coverage under control constraints as compared to any other method of adding nodes and edges. We characterize formally the notion of coverage of a roadmap in this stochastic domain via introduction of the h-$\operatorname{BRS}$ (Backward Reachable Set of Distributions) of a tree of distributions under control constraints, and also support our method with extensive simulations on a 6 DoF model.
☆ Learning Hierarchical Control For Constrained Dynamic Task Assignment
This paper introduces a novel data-driven hierarchical control scheme for managing a fleet of nonlinear, capacity-constrained autonomous agents in an iterative environment. We propose a control framework consisting of a high-level dynamic task assignment and routing layer and low-level motion planning and tracking layer. Each layer of the control hierarchy uses a data-driven MPC policy, maintaining bounded computational complexity at each calculation of a new task assignment or actuation input. We utilize collected data to iteratively refine estimates of agent capacity usage, and update MPC policy parameters accordingly. Our approach leverages tools from iterative learning control to integrate learning at both levels of the hierarchy, and coordinates learning between levels in order to maintain closed-loop feasibility and performance improvement of the connected architecture.
☆ Learning Hierarchical Control Systems for Autonomous Systems with Energy Constraints
This paper focuses on the design of hierarchical control architectures for autonomous systems with energy constraints. We focus on systems where energy storage limitations and slow recharge rates drastically affect the way the autonomous systems are operated. Using examples from space robotics and public transportation, we motivate the need for formally designed learning hierarchical control systems. We propose a learning control architecture which incorporates learning mechanisms at various levels of the control hierarchy to improve performance and resource utilization. The proposed hierarchical control scheme relies on high-level energy-aware task planning and assignment, complemented by a low-level predictive control mechanism responsible for the autonomous execution of tasks, including motion control and energy management. Simulation examples show the benefits and the limitations of the proposed architecture when learning is used to obtain a more energy-efficient task allocation.
☆ Optimizing queues with deadlines under infrequent monitoring
In this paper, we aim to improve the percentage of packets meeting their deadline in discrete-time M/M/1 queues with infrequent monitoring. More specifically, we look into policies that only monitor the system (and subsequently take actions) after a packet arrival. We model the system as an MDP and provide the optimal policy for some special cases. Furthermore, we introduce a heuristic algorithm called "AB-n" for general deadlines. Finally, we provide numerical results demonstrating the desirable performance of "AB-n" policies.
comment: 11 pages, 6 figures
☆ Designing Robust Linear Output Feedback Controller based on CLF-CBF framework via Linear~Programming(LP-CLF-CBF)
We consider the problem of designing output feedback controllers that use measurements from a set of landmarks to navigate through a cell-decomposable environment using duality, Control Lyapunov and Barrier Functions (CLF, CBF), and Linear Programming. We propose two objectives for navigating in an environment, one to traverse the environment by making loops and one by converging to a stabilization point while smoothing the transition between consecutive cells. We test our algorithms in a simulation environment, evaluating the robustness of the approach to practical conditions, such as bearing-only measurements, and measurements acquired with a camera with a limited field of view.
comment: arXiv admin note: text overlap with arXiv:2203.04416
☆ Constrained Reinforcement Learning with Smoothed Log Barrier Function
Reinforcement Learning (RL) has been widely applied to many control tasks and substantially improved the performances compared to conventional control methods in many domains where the reward function is well defined. However, for many real-world problems, it is often more convenient to formulate optimization problems in terms of rewards and constraints simultaneously. Optimizing such constrained problems via reward shaping can be difficult as it requires tedious manual tuning of reward functions with several interacting terms. Recent formulations which include constraints mostly require a pre-training phase, which often needs human expertise to collect data or assumes having a sub-optimal policy readily available. We propose a new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic. It implements an adaptive penalty for policy learning and alleviates the numerical issues that are known to complicate the application of the log barrier function method. As a result, we show that with CSAC-LB, we achieve state-of-the-art performance on several constrained control tasks with different levels of difficulty and evaluate our methods in a locomotion task on a real quadruped robot platform.
☆ Meta-learning of data-driven controllers with automatic model reference tuning: theory and experimental case study
Data-driven control offers a viable option for control scenarios where constructing a system model is expensive or time-consuming. Nonetheless, many of these algorithms are not entirely automated, often necessitating the adjustment of multiple hyperparameters through cumbersome trial-and-error processes and demanding significant amounts of data. In this paper, we explore a meta-learning approach to leverage potentially existing prior knowledge about analogous (though not identical) systems, aiming to reduce both the experimental workload and ease the tuning of the available degrees of freedom. We validate this methodology through an experimental case study involving the tuning of proportional, integral (PI) controllers for brushless DC (BLDC) motors with variable loads and architectures.
☆ Synthesizing Controller for Safe Navigation using Control Density Function
We consider the problem of navigating a nonlinear dynamical system from some initial set to some target set while avoiding collision with an unsafe set. We extend the concept of density function to control density function (CDF) for solving navigation problems with safety constraints. The occupancy-based interpretation of the measure associated with the density function is instrumental in imposing the safety constraints. The navigation problem with safety constraints is formulated as a quadratic program (QP) using CDF. The existing approach using the control barrier function (CBF) also formulates the navigation problem with safety constraints as QP. One of the main advantages of the proposed QP using CDF compared to QP formulated using CBF is that both the convergence/stability and safety can be combined and imposed using the CDF. Simulation results involving the Duffing oscillator and safe navigation of Dubin car models are provided to verify the main findings of the paper.
☆ On the continuity and smoothness of the value function in reinforcement learning and optimal control
The value function plays a crucial role as a measure for the cumulative future reward an agent receives in both reinforcement learning and optimal control. It is therefore of interest to study how similar the values of neighboring states are, i.e., to investigate the continuity of the value function. We do so by providing and verifying upper bounds on the value function's modulus of continuity. Additionally, we show that the value function is always H\"older continuous under relatively weak assumptions on the underlying system and that non-differentiable value functions can be made differentiable by slightly "disturbing" the system.
☆ Exploiting Over-The-Air Consensus for Collision Avoidance and Formation Control in Multi-Agent Systems
This paper introduces a distributed control method for multi-agent robotic systems employing Over the Air Consensus (OTA-Consensus). Designed for agents with decoupled single-integrator dynamics, this approach aims at efficient formation achievement and collision avoidance. As a distinctive feature, it leverages OTA's ability to exploit interference in wireless channels, a property traditionally considered a drawback, thus enhancing communication efficiency among robots. An analytical proof of asymptotic convergence is established for systems with time-varying communication topologies represented by sequences of strongly connected directed graphs. Comparative evaluations demonstrate significant efficiency improvements over current state-of-the-art methods, especially in scenarios with a large number of agents.
comment: Submitted to CDC 2024
☆ A new control-oriented METANET model to encompass service stations on highways
In this paper, we propose the METANET with service station (METANET-s) model, a second-order macroscopic traffic model that, compared to the classical METANET, incorporates the dynamics of service stations on highways. Specifically, we employ the (so-called) store-and-forward links to model the stop of vehicles and the possible queue forming in the process of merging back into the highway mainstream. We explore the capability of the METANET-s to capture well both traffic back propagation and capacity drops, which are typically caused by the presence of vehicles joining again the mainstream traffic from the service station. Therefore, capturing these effects is crucial to improving the model's predictive capabilities. Finally, we perform a comparative analysis with the Cell Transmission Model with service station (CTM-s), showcasing that the METANET-s describes the traffic evolution much better than its first-order counterpart.
comment: 7 pages, 4 figures, to be published in European Control Conference (ECC) 2024
☆ A Benchmark for the Application of Distributed Control Techniques to the Electricity Network of the European Economic Area
The European Economic Area Electricity Network Benchmark (EEA-ENB) is a multi-area power system representing the European network of transmission systems for electricity to facilitate the application of distributed control techniques. In the EEA-ENB we consider the Load Frequency Control (LFC) problem in the presence of renewable energy sources (RESs), and energy storage systems (ESSs). RESs are known to cause instability in power networks due to their inertia-less and intermittent characteristics, while ESSs are introduced as a resource to mitigate the problem. In the EEA-ENB, particular attention is dedicated to Distributed Model Predictive Control (DMPC), whose application is often limited to small and homogeneous test cases due to the lack of standardized large-scale scenarios for testing, and due to the large computation time required to obtain a centralized MPC action for performance comparison with DMPC strategies under consideration. The second problem is exacerbated when the scale of the system grows. To address these challenges and to provide a real-world-based and control-independent benchmark, the EEA-ENB has been developed. The benchmark includes a centralized MPC strategy providing performance and computation time metrics to compare distributed control within a repeatable and realistic simulation environment.
comment: 20 pages, 19 figures, 22 references, book chapter
☆ A Control Barrier Function Composition Approach for Multi-Agent Systems in Marine Applications
The agents within a multi-agent system (MAS) operating in marine environments often need to utilize task payloads and avoid collisions in coordination, necessitating adherence to a set of relative-pose constraints, which may include field-of-view, line-of-sight, collision-avoidance, and range constraints. A nominal controller designed for reference tracking may not guarantee the marine MAS stays safe w.r.t. these constraints. To modify the nominal input as one that enforces safety, we introduce a framework to systematically encode the relative-pose constraints as nonsmooth control barrier functions (NCBFs) and combine them as a single NCBF using Boolean composition, which enables a simplified verification process compared to using the NCBFs individually. While other relative-pose constraint functions have explicit derivatives, the challenging line-of-sight constraint is encoded with the minimum distance function between the line-of-sight set and other agents, whose derivative is not explicit. Hence, existing safe control design methods that consider composite NCBFs cannot be applied. To address this challenge, we propose a novel quadratic program formulation based on the dual of the minimum distance problem and develop a new theory to ensure the resulting control input guarantees constraint satisfaction. Lastly, we validate the effectiveness of our proposed framework on a simulated large-scale marine MAS and a real-world marine MAS comprising one Unmanned Surface Vehicle and two Unmanned Underwater Vehicles.
comment: 11 pages, 8 figures
☆ Transformation-Free Fixed-Structure Model Reduction for LPV Systems
In this paper, we propose a model reduction technique for linear parameter varying (LPV) systems based on available tools for fixed-structure controller synthesis. We start by transforming a model reduction problem into an equivalent controller synthesis problem by defining an appropriate generalized plant. The controller synthesis problem is then solved by using gradient-based tools available in the literature. Owing to the flexibility of the gradient-based synthesis tools, we are able to impose a desired structure on the obtained reduced model. Additionally, we obtain a bound on the approximation error as a direct output of the optimization problem. The proposed methods are applied on a benchmark mechanical system of interconnected masses, springs and dampers. To evaluate the effect of the proposed model-reduction approach on controller design, LPV controllers designed using the reduced models (with and without an imposed structure) are compared in closed-loop with the original model.
☆ Multi Methods of Matrix Analysis Use for Control and Optimization system in Control Engineering
Matrix analysis plays a crucial role in the field of control engineering, providing a powerful mathematical framework for the analysis and design of control systems. This research report explores various applications of matrix analysis in control engineering, focusing on its contributions to system modeling, stability analysis, controllablity, observability, and optimization. The report also discusses specific examples and case studies to illustrate the practical significance of matrix analysis in addressing real-world control engineering challenges Analyze controllability. Informally, a system is controllable if we can construct a set of inputs that will drive the system to any given state. Analyze observability. Informally, observability means that by controlling the inputs and watching the outputs of a system we can determine what the states were. Optimal Control is a control method that aims to find the optimal control input to achieve the best performance of the system under certain constraints. This performance index can be the system output, energy consumption, time, etc.
comment: 8 pages
☆ Event-triggered Boundary Control of Mixed-autonomy Traffic
Control problems of mixed-autonomy traffic system consisting of both Human-driven Vehicles (HV) and Autonomous Vehicles (AV) have gained increasing attention. This paper is focused on suppressing traffic oscillations of the mixed-autonomy traffic system using boundary control design. The mixed traffic dynamics are described by a 4 x 4 hyperbolic partial differential equations (PDE) which governs propagation of four properties in traffic including density of HV, density of AV, friction between two classes of vehicles from driving interactions, and averaged velocity. We propose event-triggered boundary control design since control signal of traffic light on ramp or varying speed limit cannot be updated in a continuous time fashion. We apply event-triggered mechanism for a PDE backstepping controller and obtain dynamic triggering condition. Lyapunov analysis is conducted to prove the exponential stability of the closed loop system with the event-triggered controller. Numerical simulation demonstrates how car-following spacing of AV affects event-triggering mechanism of control input in mixed-autonomy traffic.
☆ Conservative Linear Envelopes for High-Dimensional, Hamilton-Jacobi Reachability for Nonlinear Systems via the Hopf Formula
Hamilton-Jacobi reachability (HJR) analysis provides a value function that encodes (1) the set of states from which a nonlinear system with bounded control inputs can reach a goal (or avoid a failure set) despite any bounded disturbance, and (2) the corresponding optimal control policy to reach (or avoid). Though powerful, traditional methods for HJR rely on dynamic programming and suffer from exponential computation growth with respect to state dimension. The recently favored Hopf formula mitigates this ''curse of dimensionality'' by providing an efficient and space-parallelizable approach for solving the reachability problem. However, the Hopf formula can only be applied to linear time-varying systems. To overcome this limitation, we show that the error between a nonlinear system and a linear model can be transformed into an adversarial bounded artificial disturbance, making an envelope of the true value. One may then solve the dimension-robust Hopf formula for a linear game with this ''antagonistic error" to perform guaranteed conservative reachability analysis and control synthesis of nonlinear systems; this can be done for problem formulations in which no other HJR method is both computationally feasible and guaranteed. In addition, we offer several technical methods for reducing conservativeness in the analysis. We demonstrate the theory by solving the safe linear envelope in the controlled Van der Pol system, where the true reachable set may be observed, and by solving a 5 agent (15D) pursuit-evasion game with Dubins cars.
☆ Lane level joint control of off-ramp and main line speed guidance on expressway in rainy weather
In the upstream of the exit ramp of the expressway, the speed limit difference leads to a significant deceleration of the vehicle in the area adjacent to the off-ramp. The friction coefficient of the road surface decreases under rainy weather, and the above deceleration process can easily lead to sideslip and rollover of the vehicle. Dynamic speed guidance is an effective way to improve the status quo. Currently, there is an emerging trend to utilize I2V technology and high-precision map technology for lane level speed guidance control. This paper presents an optimized joint control strategy for main line-off-ramp speed guidance, which can adjust the guidance speed in real time according to the rainfall intensity. At the same time, this paper designs a progressive deceleration strategy, which works together with the speed guidance control to ensure the safe deceleration of vehicles. The simulation results show that the proposed control strategy outperforms the fixed speed limit control in terms of improving the total traveled time (TTT), total traveled distance (TTD) and standard deviation of speed (SD). Sensitivity analysis shows that the proposed control strategy can improve performance with the increase of the compliance rate of drivers. The speed guidance control method established in this paper can improve the vehicle operation efficiency in the off-ramp area of the expressway and reduce the speed difference of each vehicle in rainy weather, which guarantee the safety of expressway driving in the rainy day.
comment: 103rd TRB Conference
☆ LR-FHSS Transceiver for Direct-to-Satellite IoT Communications: Design, Implementation, and Verification
This paper proposes a long range-frequency hopping spread spectrum (LR-FHSS) transceiver design for the Direct-to-Satellite Internet of Things (DtS-IoT) communication system. The DtS-IoT system has recently attracted attention as a promising nonterrestrial network (NTN) solution to provide high-traffic and low-latency data transfer services to IoT devices in global coverage. In particular, this study provides guidelines for the overall DtS-IoT system architecture and design details that conform to the Long Range Wide-Area Network (LoRaWAN). Furthermore, we also detail various DtS-IoT use cases. Considering the multiple low-Earth orbit (LEO) satellites, we developed the LR-FHSS transceiver to improve system efficiency, which is the first attempt in real satellite communication systems using LR-FHSS. Moreover, as an extension of our previous work with perfect synchronization, we applied a robust synchronization scheme against the Doppler effect and co-channel interference (CCI) caused by LEO satellite channel environments, including signal detection for the simultaneous reception of numerous frequency hopping signals and an enhanced soft-output-Viterbi-algorithm (SOVA) for the header and payload receptions. Lastly, we present proof-of-concept implementation and testbeds using an application-specific integrated circuit (ASIC) chipset and a field-programmable gate array (FPGA) that verify the performance of the proposed LR-FHSS transceiver design of DtS-IoT communication systems. The laboratory test results reveal that the proposed LR-FHSS-based framework with the robust synchronization technique can provide wide coverage, seamless connectivity, and high throughput communication links for the realization of future sixth-generation (6G) networks.
comment: 17pages, 23 figures
☆ Carbon Footprint Reduction for Sustainable Data Centers in Real-Time
As machine learning workloads significantly increase energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a real-time controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions.
☆ PE-GPT: A Physics-Informed Interactive Large Language Model for Power Converter Modulation Design
This paper proposes PE-GPT, a custom-tailored large language model uniquely adapted for power converter modulation design. By harnessing in-context learning and specialized tiered physics-informed neural networks, PE-GPT guides users through text-based dialogues, recommending actionable modulation parameters. The effectiveness of PE-GPT is validated through a practical design case involving dual active bridge converters, supported by hardware experimentation. This research underscores the transformative potential of large language models in power converter modulation design, offering enhanced accessibility, explainability, and efficiency, thereby setting a new paradigm in the field.
☆ Joint Planning of Charging Stations and Power Systems for Heavy-Duty Drayage Trucks
As global concerns about climate change intensify, the transition towards zero-emission freight is becoming increasingly vital. Drayage is an important segment of the freight system, typically involving the transport of goods from seaports or intermodal terminals to nearby warehouses. This sector significantly contributes to not only greenhouse gas emissions, but also pollution in densely populated areas. This study presents a holistic optimization model designed for an efficient transition to zero-emission drayage, offering cost-effective strategies for the coordinated investment planning for power systems, charging infrastructure, and electric drayage trucks. The model is validated in the Greater Los Angeles area, where regulatory goals are among the most ambitious. Furthermore, the model's design allows for easy adaptation to other regions. By focusing on drayage trucks, this study also paves the way for future research into other freight categories, establishing a foundation for a more extensive exploration in this field.
comment: 34 pages, 10 figures
☆ Robust Model Based Reinforcement Learning Using $\mathcal{L}_1$ Adaptive Control
We introduce $\mathcal{L}_1$-MBRL, a control-theoretic augmentation scheme for Model-Based Reinforcement Learning (MBRL) algorithms. Unlike model-free approaches, MBRL algorithms learn a model of the transition function using data and use it to design a control input. Our approach generates a series of approximate control-affine models of the learned transition function according to the proposed switching law. Using the approximate model, control input produced by the underlying MBRL is perturbed by the $\mathcal{L}_1$ adaptive control, which is designed to enhance the robustness of the system against uncertainties. Importantly, this approach is agnostic to the choice of MBRL algorithm, enabling the use of the scheme with various MBRL algorithms. MBRL algorithms with $\mathcal{L}_1$ augmentation exhibit enhanced performance and sample efficiency across multiple MuJoCo environments, outperforming the original MBRL algorithms, both with and without system noise.
☆ Model order reduction of deep structured state-space models: A system-theoretic approach
With a specific emphasis on control design objectives, achieving accurate system modeling with limited complexity is crucial in parametric system identification. The recently introduced deep structured state-space models (SSM), which feature linear dynamical blocks as key constituent components, offer high predictive performance. However, the learned representations often suffer from excessively large model orders, which render them unsuitable for control design purposes. The current paper addresses this challenge by means of system-theoretic model order reduction techniques that target the linear dynamical blocks of SSMs. We introduce two regularization terms which can be incorporated into the training loss for improved model order reduction. In particular, we consider modal $\ell_1$ and Hankel nuclear norm regularization to promote sparsity, allowing one to retain only the relevant states without sacrificing accuracy. The presented regularizers lead to advantages in terms of parsimonious representations and faster inference resulting from the reduced order models. The effectiveness of the proposed methodology is demonstrated using real-world ground vibration data from an aircraft.
☆ Transmission Benefits and Cost Allocation under Ambiguity
Disputes over cost allocation can present a significant barrier to investment in shared infrastructure. While it may be desirable to allocate cost in a way that corresponds to expected benefits, investments in long-lived projects are made under conditions of substantial uncertainty. In the context of electricity transmission, uncertainty combined with the inherent complexity of power systems analysis prevents the calculation of an estimated distribution of benefits that is agreeable to all participants. To analyze aspects of the cost allocation problem, we construct a model for transmission and generation expansion planning under uncertainty, enabling the identification of transmission investments as well as the calculation of benefits for users of the network. Numerical tests confirm the potential for realized benefits at the participant level to differ significantly from ex ante estimates. Based on the model and numerical tests we discuss several issues, including 1) establishing a valid counterfactual against which to measure benefits, 2) allocating cost to new and incumbent generators vs. solely allocating to loads, 3) calculating benefits at the portfolio vs. the individual project level, 4) identifying losers in a surplus-enhancing transmission expansion, and 5) quantifying the divergence between cost allocation decisions made ex ante and benefits realized ex post.
comment: 32 pages, 7 figures, 7 tables
♻ ☆ Deep learning reduces sensor requirements for gust rejection on a small uncrewed aerial vehicle morphing wing
There is a growing need for uncrewed aerial vehicles (UAVs) to operate in cities. However, the uneven urban landscape and complex street systems cause large-scale wind gusts that challenge the safe and effective operation of UAVs. Current gust alleviation methods rely on traditional control surfaces and computationally expensive modeling to select a control action, leading to a slower response. Here, we used deep reinforcement learning to create an autonomous gust alleviation controller for a camber-morphing wing. This method reduced gust impact by 84%, directly from real-time, on-board pressure signals. Notably, we found that gust alleviation using signals from only three pressure taps was statistically indistinguishable from using six signals. This reduced-sensor fly-by-feel control opens the door to UAV missions in previously inoperable locations.
♻ ☆ Adversarial Attacks and Defenses in Automated Control Systems: A Comprehensive Benchmark
Integrating machine learning into Automated Control Systems (ACS) enhances decision-making in industrial process management. One of the limitations to the widespread adoption of these technologies in industry is the vulnerability of neural networks to adversarial attacks. This study explores the threats in deploying deep learning models for fault diagnosis in ACS using the Tennessee Eastman Process dataset. By evaluating three neural networks with different architectures, we subject them to six types of adversarial attacks and explore five different defense methods. Our results highlight the strong vulnerability of models to adversarial samples and the varying effectiveness of defense strategies. We also propose a novel protection approach by combining multiple defense methods and demonstrate it's efficacy. This research contributes several insights into securing machine learning within ACS, ensuring robust fault diagnosis in industrial processes.
♻ ☆ Bearing-Constrained Leader-Follower Formation of Single-Integrators with Disturbance Rejection: Adaptive Variable-Structure Approaches
This paper studies the problem of stabilizing a leader-follower formation specified by a set of bearing constraints and being disturbed by unknown uniformly bounded disturbance. A set of leaders are positioned at the desired positions, while each follower agent is modeled by a single integrator with disturbance of which the upper bound is unavailable for the control design. Adaptive variable-structure formation control laws using only displacements or bearing vectors are provided to stabilize the agents to a desired formation. Thanks to the adaptive mechanism, the control laws require neither the information of the bearing Laplacian nor the directions and the upper bounds of the disturbance. It is further proved that when the leaders are moving with the same bounded uniformly continuous velocity, the moving target formation can still be achieved under the proposed control laws. Simulation results are also given to support the stability analysis.
comment: 18 pages, 6 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Optimizing the Electrical Interface for Large-Scale Color-Center Quantum Processors
Quantum processors based on color centers in diamond are promising candidates for future large-scale quantum computers thanks to their flexible optical interface, (relatively) high operating temperature, and high-fidelity operation. Similar to other quantum-computing platforms, the electrical interface required to control and read out such qubits may limit both the performance of the whole system and its scalability. To address this challenge, this work analyzes the requirements of the electrical interface and investigates how to efficiently implement the electronic controller in a scalable architecture comprising a large number of identical unit cells. Among the different discussed functionalities, a specific focus is devoted to the generation of the static and dynamic magnetic fields driving the electron and nuclear spins, because of their major impact on fidelity and scalability. Following the derived requirements, different system architectures, such as a qubit frequency-multiplexing scheme, are considered to identify the most power efficient approach, especially in the presence of inhomogeneity of the qubit Larmor frequency across the processor. As a result, a non-frequency-multiplexed, 1-mm$^2$ unit-cell architecture is proposed as the optimal solution, able to address up to one electron-spin qubit and 9 nuclear-spin qubits within a 3-mW average power consumption, thus establishing the baseline for the scalable electrical interface for future large-scale color-center quantum computers.
comment: 20 pages, 10 figures; Revision 21-03-2024: Corrected affiliation, corrected typo Fig. 2, corrected DC noise PSD value (Z-field noise, X/Y-field noise) in Tab.1
♻ ☆ Input-output linearization and decoupling of mechanical control systems
In this work, we present a problem of simultaneous input-output feedback linearization and decoupling (non-interacting) for mechanical control systems with outputs. We show that the natural requirement of preserving mechanical structure of the system and of transformations imposes supplementary conditions when compared to the classical solution of the same problem for general control systems. These conditions can be expressed using objects on the configuration space only. We illustrate our results with several examples of mechanical control systems.
♻ ☆ Stabilizing reinforcement learning control: A modular framework for optimizing over all stable behavior
We propose a framework for the design of feedback controllers that combines the optimization-driven and model-free advantages of deep reinforcement learning with the stability guarantees provided by using the Youla-Kucera parameterization to define the search domain. Recent advances in behavioral systems allow us to construct a data-driven internal model; this enables an alternative realization of the Youla-Kucera parameterization based entirely on input-output exploration data. Perhaps of independent interest, we formulate and analyze the stability of such data-driven models in the presence of noise. The Youla-Kucera approach requires a stable "parameter" for controller design. For the training of reinforcement learning agents, the set of all stable linear operators is given explicitly through a matrix factorization approach. Moreover, a nonlinear extension is given using a neural network to express a parameterized set of stable operators, which enables seamless integration with standard deep learning libraries. Finally, we show how these ideas can also be applied to tune fixed-structure controllers.
comment: Postprint; 31 pages. arXiv admin note: text overlap with arXiv:2304.03422
♻ ☆ Representative Days and Hours with Piecewise Linear Transitions for Power System Planning
Electric demand and renewable power are highly variable, and the solution of a planning model relies on capturing this variability. This paper proposes a hybrid multi-area method that effectively captures both the intraday and interday chronology of real data considering extreme values, using a limited number of representative days, and time points within each day. An optimization-based representative extraction method is proposed to improve intraday chronology capturing. It ensures higher precision in preserving data chronology and extreme values than hierarchical clustering methods. The proposed method is based on a piecewise linear demand and supply representation, which reduces approximation errors compared to the traditional piecewise constant formulation. Additionally, sequentially linked day blocks with identical representatives, created through a mapping process, are employed for interday chronology capturing. To evaluate the efficiency of the proposed method, a comprehensive expansion co-planning model is developed, including transmission lines, energy storage systems, and wind farms.
♻ ☆ A Convex Parameterization of Controllers Constrained to use only Relative Measurements
The optimal controller design problem for systems equipped with sensors that measure only relative, rather than absolute, quantities is considered. This relative measurement structure is formulated as a design constraint; it is demonstrated that the resulting constrained controller design problem can be written as a convex program. Certain additional network structural constraints can be incorporated into this formulation, making it especially useful in distributed or networked settings. An illustrative example highlights the advantage of the proposed methodology over the standard formulation of the output feedback controller design problem. A numerical example is provided.
comment: 6 pages, 2 figures
♻ ☆ Markov Decision Process Design: A Framework for Integrating Strategic and Operational Decisions
We consider the problem of optimally designing a system for repeated use under uncertainty. We develop a modeling framework that integrates design and operational phases, which are represented by a mixed-integer program and discounted-cost infinite-horizon Markov decision processes, respectively. We seek to simultaneously minimize the design costs and the subsequent expected operational costs. This problem setting arises naturally in several application areas, as we illustrate through examples. We derive a bilevel mixed-integer linear programming formulation for the problem and perform a computational study to demonstrate that realistic instances can be solved numerically.
♻ ☆ Scalable Neural Dynamic Equivalence for Power Systems
Traditional grid analytics are model-based, relying strongly on accurate models of power systems, especially the dynamic models of generators, controllers, loads and other dynamic components. However, acquiring thorough power system models can be impractical in real operation due to inaccessible system parameters and privacy of consumers, which necessitate data-driven dynamic equivalencing of unknown subsystems. Learning reliable dynamic equivalent models for the external systems from SCADA and PMU data, however, is a long-standing intractable problem in power system analysis due to complicated nonlinearity and unforeseeable dynamic modes of power systems. This paper advances a practical application of neural dynamic equivalence (NeuDyE) called Driving Port NeuDyE (DP-NeuDyE), which exploits physics-informed machine learning and neural-ordinary-differential-equations (ODE-NET) to discover a dynamic equivalence of external power grids while preserving its dynamic behaviors after disturbances. The new contributions are threefold: A NeuDyE formulation to enable a continuous-time, data-driven dynamic equivalence of power systems, saving the effort and expense of acquiring inaccessible system; An introduction of a Physics-Informed NeuDyE learning (PI-NeuDyE) to actively control the closed-loop accuracy of NeuDyE; and A DP-NeuDyE to reduce the number of inputs required for the training. We conduct extensive case studies on the NPCC system to validate the generalizability and accuracy of both PI-NeuDyE and DP-NeuDyE, which span a multitude of scenarios, differing in the time required for fault clearance, the specific fault locations, and the limitations of data. Test results have demonstrated the scalability and practicality of NeuDyE, showing its potential to be used in ISO and utility control centers for online transient stability analysis and for planning purposes.
Multiagent Systems
☆ Extrapolating Solution Paths of Polynomial Homotopies towards Singularities with PHCpack and phcpy
A robust path tracker [Telen, Van Barel, Verschelde, SISC 2020] computes the radius of convergence of Newton's method, estimates the distance to the nearest path, and then applies Pad\'e approximants to predict the next point on the path. Apriori step size control is less sensitive to finely tuned tolerances than aposteriori step size control, and is therefore robust. Extrapolation methods are effective to accurately locate the singular points at the end of solution paths, as illustrated with phcpy, the scripting interface to PHCpack.
♻ ☆ CUQIpy: II. Computational uncertainty quantification for PDE-based inverse problems in Python
Inverse problems, particularly those governed by Partial Differential Equations (PDEs), are prevalent in various scientific and engineering applications, and uncertainty quantification (UQ) of solutions to these problems is essential for informed decision-making. This second part of a two-paper series builds upon the foundation set by the first part, which introduced CUQIpy, a Python software package for computational UQ in inverse problems using a Bayesian framework. In this paper, we extend CUQIpy's capabilities to solve PDE-based Bayesian inverse problems through a general framework that allows the integration of PDEs in CUQIpy, whether expressed natively or using third-party libraries such as FEniCS. CUQIpy offers concise syntax that closely matches mathematical expressions, streamlining the modeling process and enhancing the user experience. The versatility and applicability of CUQIpy to PDE-based Bayesian inverse problems are demonstrated on examples covering parabolic, elliptic and hyperbolic PDEs. This includes problems involving the heat and Poisson equations and application case studies in electrical impedance tomography and photo-acoustic tomography, showcasing the software's efficiency, consistency, and intuitive interface. This comprehensive approach to UQ in PDE-based inverse problems provides accessibility for non-experts and advanced features for experts.
comment: 38 pages, 12 figures
♻ ☆ CUQIpy: I. Computational uncertainty quantification for inverse problems in Python
This paper introduces CUQIpy, a versatile open-source Python package for computational uncertainty quantification (UQ) in inverse problems, presented as Part I of a two-part series. CUQIpy employs a Bayesian framework, integrating prior knowledge with observed data to produce posterior probability distributions that characterize the uncertainty in computed solutions to inverse problems. The package offers a high-level modeling framework with concise syntax, allowing users to easily specify their inverse problems, prior information, and statistical assumptions. CUQIpy supports a range of efficient sampling strategies and is designed to handle large-scale problems. Notably, the automatic sampler selection feature analyzes the problem structure and chooses a suitable sampler without user intervention, streamlining the process. With a selection of probability distributions, test problems, computational methods, and visualization tools, CUQIpy serves as a powerful, flexible, and adaptable tool for UQ in a wide selection of inverse problems. Part II of the series focuses on the use of CUQIpy for UQ in inverse problems with partial differential equations (PDEs).
comment: 37 pages, 11 figures
♻ ☆ A new open source framework for multiscale modeling of fibrous materials on heterogeneous supercomputers
This article presents MuMFiM, an open source application for multiscale modeling of fibrous materials on massively parallel computers. MuMFiM uses two scales to represent fibrous materials such as biological network materials (extracellular matrix, connective tissue, etc.). It is designed to make use of multiple levels of parallelism, including distributed parallelism of the macro and microscales as well as GPU accelerated data-parallelism of the microscale. Scaling results of the GPU accelerated microscale show that solving microscale problems concurrently on the GPU can lead to a 1000x speedup over the solution of a single RVE on the GPU. In addition, we show nearly optimal strong and weak scaling results of MuMFiM on up to 128 nodes of AiMOS (Rensselaer Polytechnic Institute) which is composed of IBM AC922 nodes with 6 Volta V100 GPU and 2 20 core Power 9 CPUs each. We also show how MuMFiM can be used to solve problems of interest to the broader engineering community, in particular providing an example of the facet capsule ligament (FCL) of the human spine undergoing uniaxial extension.
Databases
☆ Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach
Quantifying the semantic similarity between database queries is a critical challenge with broad applications, ranging from query log analysis to automated educational assessment of SQL skills. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence. This paper introduces a novel graph-based approach to measure the semantic dissimilarity between SQL queries. Queries are represented as nodes in an implicit graph, while the transitions between nodes are called edits, which are weighted by semantic dissimilarity. We employ shortest path algorithms to identify the lowest-cost edit sequence between two given queries, thereby defining a quantifiable measure of semantic distance. A prototype implementation of this technique has been evaluated through an empirical study, which strongly suggests that our method provides more accurate and comprehensible grading compared to existing techniques. Moreover, the results indicate that our approach comes close to the quality of manual grading, making it a robust tool for diverse database query comparison tasks.
comment: 12 pages
☆ Space-Efficient Indexes for Uncertain Strings ICDE 2024
Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $\Sigma$ is a sequence of $n$ probability distributions over $\Sigma$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ is at least $\frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $\mathcal{O}(nz)$ size, requires $\mathcal{O}(nz)$ time and $\mathcal{O}(nz)$ space to be constructed, and answers pattern matching queries in the optimal $\mathcal{O}(m+|\text{Occ}|)$ time, where $m$ is the length of $P$ and $|\text{Occ}|$ is the total number of occurrences of $P$ in $X$. For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected size, which can be constructed using $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length $m\geq \ell$. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.
comment: Accepted to ICDE 2024. Abstract abridged to satisfy arXiv requirements
☆ Deep Learning for Trajectory Data Management and Mining: A Survey and Beyond
Trajectory computing is a pivotal domain encompassing trajectory data management and mining, garnering widespread attention due to its crucial role in various practical applications such as location services, urban traffic, and public safety. Traditional methods, focusing on simplistic spatio-temporal features, face challenges of complex calculations, limited scalability, and inadequate adaptability to real-world complexities. In this paper, we present a comprehensive review of the development and recent advances in deep learning for trajectory computing (DL4Traj). We first define trajectory data and provide a brief overview of widely-used deep learning models. Systematically, we explore deep learning applications in trajectory management (pre-processing, storage, analysis, and visualization) and mining (trajectory-related forecasting, trajectory-related recommendation, trajectory classification, travel time estimation, anomaly detection, and mobility generation). Notably, we encapsulate recent advancements in Large Language Models (LLMs) that hold the potential to augment trajectory computing. Additionally, we summarize application scenarios, public datasets, and toolkits. Finally, we outline current challenges in DL4Traj research and propose future directions. Relevant papers and open-source resources have been collated and are continuously updated at: \href{https://github.com/yoshall/Awesome-Trajectory-Computing}{DL4Traj Repo}.
comment: 25 pages, 12 figures, 5 tables
☆ Gen-T: Table Reclamation in Data Lakes
We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.
♻ ☆ TensorBank: Tensor Lakehouse for Foundation Model Training
Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
♻ ☆ Generating Explanations to Understand and Repair Embedding-based Entity Alignment ICDE 2024
Entity alignment (EA) seeks identical entities in different knowledge graphs, which is a long-standing task in the database research. Recent work leverages deep learning to embed entities in vector space and align them via nearest neighbor search. Although embedding-based EA has gained marked success in recent years, it lacks explanations for alignment decisions. In this paper, we present the first framework that can generate explanations for understanding and repairing embedding-based EA results. Given an EA pair produced by an embedding model, we first compare its neighbor entities and relations to build a matching subgraph as a local explanation. We then construct an alignment dependency graph to understand the pair from an abstract perspective. Finally, we repair the pair by resolving three types of alignment conflicts based on dependency graphs. Experiments on a variety of EA datasets demonstrate the effectiveness, generalization, and robustness of our framework in explaining and repairing embedding-based EA results.
comment: Accepted in the 40th IEEE International Conference on Data Engineering (ICDE 2024)
♻ ☆ On enforcing dyadic-type homogeneous binary function product constraints in MatBase
Homogeneous binary function products are often encountered in the sub-universes modeled by databases, from genealogical trees to sports, from education to healthcare, etc. Their properties must be discovered and enforced by the software applications managing such data to guarantee plausibility. The (Elementary) Mathematical Data Model provides 18 dyadic-type homogeneous binary function product constraint types. MatBase, an intelligent data and knowledge base management system prototype, allows database designers to simply declare them by only clicking corresponding checkboxes and automatically generates code for enforcing them. This paper describes the algorithms that MatBase uses for enforcing all these 18 homogeneous binary function product constraint types, which may also be used by developers not having access to MatBase.
comment: submitted on Dec. 7, 2023, to the Journal of Data Science and Intelligent Systems (JDSIS), on Dec. 20, 2023, to the Journal of Computational and Cognitive Engineering, both of Bon View Publishing, Singapore, on Dec. 30 to the Journal of Current Research and Studies, and on Jan. 25, 2024, to the Journal of Computer Science Research, Bilingual Publishing Group, Singapore
Emerging Technologies
☆ VL-DNA: Enhance DNA Storage Capacity with Variable Payload (Strand) Lengths
DNA storage is a promising archival data storage solution to today's big data problem. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. For efficient target data retrieving, existing Polymerase Chain Reaction PCR based DNA storage systems apply primers as specific identifier to tag different set of DNA strands. However, the PCR based DNA storage system suffers from primer-payload collisions, causing a significant reduction of storage capacity. This paper proposes using variable strand length, which takes advantage of the inherent payload-cutting process, to split collisions and recover primers. The executing time of our scheme is linear to the number of primer-payload collisions. The scheme serves as a post-processing method to any DNA encoding scheme. The evaluation of three state-of-the-art encoding schemes shows that the scheme can recover thousands of usable primers and improve tube capacity ranging from 18.27% to 19x.
☆ Photonic-Electronic Integrated Circuits for High-Performance Computing and AI Accelerator
In recent decades, the demand for computational power has surged, particularly with the rapid expansion of artificial intelligence (AI). As we navigate the post-Moore's law era, the limitations of traditional electrical digital computing, including process bottlenecks and power consumption issue, are propelling the search for alternative computing paradigms. Among various emerging technologies, integrated photonics stands out as a promising solution for the next generation of high-performance computing due to the inherent advantages of light, such as low latency, high bandwidth, and unique multiplexing techniques. Furthermore, the progress in photonic integrated circuits (PICs), which are equipped with abundant photoelectronic components, positions photonic-electronic integrated circuits as a viable solution for high-performance computing and as hardware AI accelerators. In this review, we survey recent advancements in both PIC-based digital and analog computing for AI, exploring the principal benefits and obstacles of implementation. Additionally, we propose a comprehensive analysis of photonic AI from the perspectives of hardware implementation, accelerator architecture, and software-hardware co-design. In the end, acknowledging the existing challenges, we underscore potential strategies for overcoming these issues and offer insights into the future drivers for optical computing.
☆ Collision Aware Data Allocation In Multi-tube DNA Storage
DNA storage is a promising archival data storage solution to today's big data problem. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. For efficient target data retrieving, existing Polymerase Chain Reaction (PCR) based DNA storage systems apply primers as specific identifiers to tag different sets of DNA strands. However, if a primer has collisions with any payload in the same DNA tube, the primer cannot safely serve as an identifier and must be disabled in this tube. In a DNA storage system with multiple DNA tubes, the primer-payload collisions can spread over all DNA tubes, repeatedly disable many primers, and cause a significant overall capacity reduction. This paper proposes using a collision-aware data allocation scheme to allocate data with different collisions into different tubes so that a primer banned in a tube because of primer-payload collision can be reused in other tubes. This allocation helps increase the number of usable primers over all tubes thus enhancing the overall storage capacity. The executing time of our scheme is $O(n^2)$ to the number of digital data chunks. The scheme serves as a pre-processing method for any DNA storage system. The evaluation of the state-of-the-art encoding scheme shows that the scheme can increase 20%-25% overall storage capacity.
comment: arXiv admin note: text overlap with arXiv:2403.14204
♻ ☆ Collaborative Distributed Machine Learning
Various collaborative distributed machine learning (CDML) systems, including federated learning systems and swarm learning systems, with different key traits were developed to leverage resources for development and use of machine learning (ML) models in a confidentiality-preserving way. To meet use case requirements, suitable CDML systems need to be selected. However, comparison between CDML systems regarding their suitability for use cases is often difficult. This work presents a CDML system conceptualization and CDML archetypes to support comparison of CDML systems and introduce scientific and practical audiences to the principal functioning and key traits of CDML systems.
♻ ☆ AltGraph: Redesigning Quantum Circuits Using Generative Graph Models for Efficient Optimization
Quantum circuit transformation aims to produce equivalent circuits while optimizing for various aspects such as circuit depth, gate count, and compatibility with modern Noisy Intermediate Scale Quantum (NISQ) devices. There are two techniques for circuit transformation. The first is a rule-based approach that greedily cancels out pairs of gates that equate to the identity unitary operation. Rule-based approaches are used in quantum compilers such as Qiskit, tket, and Quilc. The second is a search-based approach that tries to find an equivalent quantum circuit by exploring the quantum circuits search space. Search-based approaches typically rely on machine learning techniques such as generative models and Reinforcement Learning (RL). In this work, we propose AltGraph, a novel search-based circuit transformation approach that generates equivalent quantum circuits using existing generative graph models. We use three main graph models: DAG Variational Autoencoder (D-VAE) with two variants: Gated Recurrent Unit (GRU) and Graph Convolutional Network (GCN), and Deep Generative Model for Graphs (DeepGMG) that take a Direct Acyclic Graph (DAG) of the quantum circuit as input and output a new DAG from which we reconstruct the equivalent quantum circuit. Next, we perturb the latent space to generate equivalent quantum circuits some of which may be more compatible with the hardware coupling map and/or enable better optimization leading to reduced gate count and circuit depth. AltGraph achieves on average a 37.55% reduction in the number of gates and a 37.75% reduction in the circuit depth post-transpiling compared to the original transpiled circuit with only 0.0074 Mean Squared Error (MSE) in the density matrix.
Symbolic Computation
☆ Extrapolating Solution Paths of Polynomial Homotopies towards Singularities with PHCpack and phcpy
A robust path tracker [Telen, Van Barel, Verschelde, SISC 2020] computes the radius of convergence of Newton's method, estimates the distance to the nearest path, and then applies Pad\'e approximants to predict the next point on the path. Apriori step size control is less sensitive to finely tuned tolerances than aposteriori step size control, and is therefore robust. Extrapolation methods are effective to accurately locate the singular points at the end of solution paths, as illustrated with phcpy, the scripting interface to PHCpack.
Data Structures and Algorithms
☆ Agent-based MST Construction
{\em Minimum-weight spanning tree} (MST) is one of the fundamental and well-studied problems in distributed computing. In this paper, we initiate the study of constructing MST using mobile agents (aka robots). Suppose $n$ agents are positioned initially arbitrarily on the nodes of a connected, undirected, arbitrary, anonymous, port-labeled, weighted $n$-node, $m$-edge graph $G$ of diameter $D$ and maximum degree $\Delta$. The agents relocate themselves autonomously and compute an MST of $G$ such that exactly one agent positions on a node and tracks in its memory which of its adjacent edges belong to the MST. The objective is to minimize time and memory requirements. Following the literature, we consider the synchronous setting in which each agent performs its operations synchronously with others and hence time can be measured in rounds. We first establish a generic result: if $n$ and $\Delta$ are known a priori and memory per agent is as much as node memory in the message-passing model (of distributed computing), agents can simulate any $O(T)$-round deterministic algorithm for any problem in the message-passing model to the agent model in $O(\Delta T \log n+n\log^2n)$ rounds. As a corollary, MST can be constructed in the agent model in $O(\max\{\Delta \sqrt{n} \log n \log^*n, \Delta D \log n,n\log^2n\})$ rounds simulating the celebrated $O(\sqrt{n} \log^*n +D)$-round GKP algorithm for MST in the message-passing model. We then establish that, without knowing any graph parameter a priori, there exists a deterministic algorithm to construct MST in the agent model in $O(m+n\log n)$ rounds with $O(n \log n)$ bits memory at each agent. The presented algorithm needs to overcome highly non-trivial challenges on how to synchronize agents in computing MST as they may initially be positioned arbitrarily on the graph nodes.
comment: 26 pages
☆ How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures
The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral , we design novel elastically relaxed, lock-free queues and stacks capable of reconfiguring relaxation during run time. We establish linearizability and define upper bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs hold up against state-of-the-art statically relaxed designs, while also swiftly managing trade-offs between relaxation and operational latency. We also outline how to use the Lateral to design elastically relaxed lock-free counters and deques.
♻ ☆ The Runtime of Random Local Search on the Generalized Needle Problem
In their recent work, C. Doerr and Krejca (Transactions on Evolutionary Computation, 2023) proved upper bounds on the expected runtime of the randomized local search heuristic on generalized Needle functions. Based on these upper bounds, they deduce in a not fully rigorous manner a drastic influence of the needle radius $k$ on the runtime. In this short article, we add the missing lower bound necessary to determine the influence of parameter $k$ on the runtime. To this aim, we derive an exact description of the expected runtime, which also significantly improves the upper bound given by C. Doerr and Krejca. We also describe asymptotic estimates of the expected runtime.
comment: 18 pages
♻ ☆ Towards the Characterization of Terminal Cut Functions: a Condition for Laminar Families
We study the following characterization problem. Given a set $T$ of terminals and a $(2^{|T|}-2)$-dimensional vector $\pi$ whose coordinates are indexed by proper subsets of $T$, is there a graph $G$ that contains $T$, such that for all subsets $\emptyset\subsetneq S\subsetneq T$, $\pi_S$ equals the value of the min-cut in $G$ separating $S$ from $T\setminus S$? The only known necessary conditions are submodularity and a special class of linear inequalities given by Chaudhuri, Subrahmanyam, Wagner and Zaroliagis. Our main result is a new class of linear inequalities concerning laminar families, that generalize all previous ones. Using our new class of inequalities, we can generalize Karger's approximate min-cut counting result to graphs with terminals.
♻ ☆ A Survey on Graph Problems Parameterized Above and Below Guaranteed Values
We survey the field of algorithms and complexity for graph problems parameterized above or below guaranteed values, a research area which was pioneered by Venkatesh Raman. Those problems seek, for a given graph $G$, a solution whose value is at least $g(G)+k$ or at most $g(G)-k$, where $g(G)$ is a guarantee on the value that any solution on $G$ takes. The goal is to design algorithms which find such solution in time whose complexity in $k$ is decoupled from that in the guarantee, or to rule out the existence of such algorithms by means of intractability results. We discuss a large number of algorithms and intractability results, and complement them by several open problems.
Human-Computer Interaction
☆ MotorEase: Automated Detection of Motor Impairment Accessibility Issues in Mobile App UIs ICSE 2024
Recent research has begun to examine the potential of automatically finding and fixing accessibility issues that manifest in software. However, while recent work makes important progress, it has generally been skewed toward identifying issues that affect users with certain disabilities, such as those with visual or hearing impairments. However, there are other groups of users with different types of disabilities that also need software tooling support to improve their experience. As such, this paper aims to automatically identify accessibility issues that affect users with motor-impairments. To move toward this goal, this paper introduces a novel approach, called MotorEase, capable of identifying accessibility issues in mobile app UIs that impact motor-impaired users. Motor-impaired users often have limited ability to interact with touch-based devices, and instead may make use of a switch or other assistive mechanism -- hence UIs must be designed to support both limited touch gestures and the use of assistive devices. MotorEase adapts computer vision and text processing techniques to enable a semantic understanding of app UI screens, enabling the detection of violations related to four popular, previously unexplored UI design guidelines that support motor-impaired users, including: (i) visual touch target size, (ii) expanding sections, (iii) persisting elements, and (iv) adjacent icon visual distance. We evaluate MotorEase on a newly derived benchmark, called MotorCheck, that contains 555 manually annotated examples of violations to the above accessibility guidelines, across 1599 screens collected from 70 applications via a mobile app testing tool. Our experiments illustrate that MotorEase is able to identify violations with an average accuracy of ~90%, and a false positive rate of less than 9%, outperforming baseline techniques.
comment: Accepted to ICSE 2024 Research Track, 13 pages
☆ Learning User Embeddings from Human Gaze for Personalised Saliency Prediction
Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.
☆ VCounselor: A Psychological Intervention Chat Agent Based on a Knowledge-Enhanced Large Language Model
Conversational artificial intelligence can already independently engage in brief conversations with clients with psychological problems and provide evidence-based psychological interventions. The main objective of this study is to improve the effectiveness and credibility of the large language model in psychological intervention by creating a specialized agent, the VCounselor, to address the limitations observed in popular large language models such as ChatGPT in domain applications. We achieved this goal by proposing a new affective interaction structure and knowledge-enhancement structure. In order to evaluate VCounselor, this study compared the general large language model, the fine-tuned large language model, and VCounselor's knowledge-enhanced large language model. At the same time, the general large language model and the fine-tuned large language model will also be provided with an avatar to compare them as an agent with VCounselor. The comparison results indicated that the affective interaction structure and knowledge-enhancement structure of VCounselor significantly improved the effectiveness and credibility of the psychological intervention, and VCounselor significantly provided positive tendencies for clients' emotions. The conclusion of this study strongly supports that VConselor has a significant advantage in providing psychological support to clients by being able to analyze the patient's problems with relative accuracy and provide professional-level advice that enhances support for clients.
comment: 24 pages, 6 figures
☆ The Tribal Theater Model: Social Regulation for Dynamic User Adaptation in Virtual Interactive Environments
This paper proposes a social regulation model for dynamic adaptation according to user characteristics in virtual interactive environments, namely the tribal theater model. The model focuses on organizational regulation and builds an interaction scheme with more resilient user performance by improving the subjectivity of the user. This paper discusses the sociological theoretical basis of this model and how it was migrated to an engineering implementation of a virtual interactive environment. The model defines user interactions within a field that are regulated by a matrix through the allocation of resources. To verify the effectiveness of the tribal theater model, we designed an experimental scene using a chatroom as an example. We trained the matrix as an AI model using a temporal transformer and compared it with an interaction field with different levels of control. The experimental results showed that the tribal theater model can improve users' interactive experience, enhance resilient user performance, and effectively complete environmental interaction tasks under rule-based interaction.
comment: 20 pages, 6 figures
☆ From One to Many: How Active Robot Swarm Sizes Influence Human Cognitive Processes
In robotics, understanding human interaction with autonomous systems is crucial for enhancing collaborative technologies. We focus on human-swarm interaction (HSI), exploring how differently sized groups of active robots affect operators' cognitive and perceptual reactions over different durations. We analyze the impact of different numbers of active robots within a 15-robot swarm on operators' time perception, emotional state, flow experience, and task difficulty perception. Our findings indicate that managing multiple active robots when compared to one active robot significantly alters time perception and flow experience, leading to a faster passage of time and increased flow. More active robots and extended durations cause increased emotional arousal and perceived task difficulty, highlighting the interaction between robot the number of active robots and human cognitive processes. These insights inform the creation of intuitive human-swarm interfaces and aid in developing swarm robotic systems aligned with human cognitive structures, enhancing human-robot collaboration.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Putting Our Minds Together: Iterative Exploration for Collaborative Mind Mapping
We delineate the development of a mind-mapping system designed concurrently for both VR and desktop platforms. Employing an iterative methodology with groups of users, we systematically examined and improved various facets of our system, including interactions, communication mechanisms and gamification elements, to streamline the mind-mapping process while augmenting situational awareness and promoting active engagement among collaborators. We also report our observational findings on these facets from this iterative design process.
comment: Accepted at AHs 2024
☆ USE: Dynamic User Modeling with Stateful Sequence Models
User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.
☆ Workload Estimation for Unknown Tasks: A Survey of Machine Learning Under Distribution Shift
Human-robot teams involve humans and robots collaborating to achieve tasks under various environmental conditions. Successful teaming will require robots to adapt autonomously to a human teammate's internal state. An important element of such adaptation is the ability to estimate the human teammates' workload in unknown situations. Existing workload models use machine learning to model the relationships between physiological metrics and workload; however, these methods are susceptible to individual differences and are heavily influenced by other factors. These methods cannot generalize to unknown tasks, as they rely on standard machine learning approaches that assume data consists of independent and identically distributed (IID) samples. This assumption does not necessarily hold for estimating workload for new tasks. A survey of non-IID machine learning techniques is presented, where commonly used techniques are evaluated using three criteria: portability, model complexity, and adaptability. These criteria are used to argue which techniques are most applicable for estimating workload for unknown tasks in dynamic, real-time environments.
☆ Reading Users' Minds from What They Say: An Investigation into LLM-based Empathic Mental Inference
In human-centered design, developing a comprehensive and in-depth understanding of user experiences, i.e., empathic understanding, is paramount for designing products that truly meet human needs. Nevertheless, accurately comprehending the real underlying mental states of a large human population remains a significant challenge today. This difficulty mainly arises from the trade-off between depth and scale of user experience research: gaining in-depth insights from a small group of users does not easily scale to a larger population, and vice versa. This paper investigates the use of Large Language Models (LLMs) for performing mental inference tasks, specifically inferring users' underlying goals and fundamental psychological needs (FPNs). Baseline and benchmark datasets were collected from human users and designers to develop an empathic accuracy metric for measuring the mental inference performance of LLMs. The empathic accuracy of inferring goals and FPNs of different LLMs with varied zero-shot prompt engineering techniques are experimented against that of human designers. Experimental results suggest that LLMs can infer and understand the underlying goals and FPNs of users with performance comparable to that of human designers, suggesting a promising avenue for enhancing the scalability of empathic design approaches through the integration of advanced artificial intelligence technologies. This work has the potential to significantly augment the toolkit available to designers during human-centered design, enabling the development of both large-scale and in-depth understanding of users' experiences.
comment: Submitted to IDETC-CIE2024
☆ "It's Not a Replacement:" Enabling Parent-Robot Collaboration to Support In-Home Learning Experiences of Young Children
Learning companion robots for young children are increasingly adopted in informal learning environments. Although parents play a pivotal role in their children's learning, very little is known about how parents prefer to incorporate robots into their children's learning activities. We developed prototype capabilities for a learning companion robot to deliver educational prompts and responses to parent-child pairs during reading sessions and conducted in-home user studies involving 10 families with children aged 3-5. Our data indicates that parents want to work with robots as collaborators to augment parental activities to foster children's learning, introducing the notion of parent-robot collaboration. Our findings offer an empirical understanding of the needs and challenges of parent-child interaction in informal learning scenarios and design opportunities for integrating a companion robot into these interactions. We offer insights into how robots might be designed to facilitate parent-robot collaboration, including parenting policies, collaboration patterns, and interaction paradigms.
☆ PureConnect: A Localized Social Media System to Increase Awareness and Connectedness in Environmental Justice Communities
Frequent disruptions like highway constructions are common now-a-days, often impacting environmental justice communities (communities with low socio-economic status with disproportionately high and adverse human health and environmental effects) that live nearby. Based on our interactions via focus groups with the members of four environmental justice communities impacted by a major highway construction, a common concern is a sense of uncertainty about project activities and loss of social connectedness, leading to increased stress, depression, anxiety and diminished well-being. This paper addresses this concern by developing a localized social media system called PureConnect with a goal to raise the level of awareness about the project and increase social connectedness among the community members. PureConnect has been designed using active engagement with four environmental justice communities affected by a major highway construction. It has been deployed in the real world among the members of the four environmental justice communities, and a detailed analysis of the data collected from this deployment as well as surveys show that PureConnect is potentially useful in improving community members' well-being and the members appreciate the functionalities it provides.
comment: Submitted in COMPSAC 2024
☆ HRI Curriculum for a Liberal Arts Education
In this paper, we discuss the opportunities and challenges of teaching a human-robot interaction course at an undergraduate liberal arts college. We provide a sample syllabus adapted from a previous version of a course.
comment: Presented at the Designing an Intro to HRI Course Workshop at HRI 2024 (arXiv:2403.05588)
☆ Crowdsourcing Task Traces for Service Robotics
Demonstration is an effective end-user development paradigm for teaching robots how to perform new tasks. In this paper, we posit that demonstration is useful not only as a teaching tool, but also as a way to understand and assist end-user developers in thinking about a task at hand. As a first step toward gaining this understanding, we constructed a lightweight web interface to crowdsource step-by-step instructions of common household tasks, leveraging the imaginations and past experiences of potential end-user developers. As evidence of the utility of our interface, we deployed the interface on Amazon Mechanical Turk and collected 207 task traces that span 18 different task categories. We describe our vision for how these task traces can be operationalized as task models within end-user development tools and provide a roadmap for future work.
comment: Published in the companion proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction
☆ "This is not a data problem": Algorithms and Power in Public Higher Education in Canada CHI '24
Algorithmic decision-making is increasingly being adopted across public higher education. The expansion of data-driven practices by post-secondary institutions has occurred in parallel with the adoption of New Public Management approaches by neoliberal administrations. In this study, we conduct a qualitative analysis of an in-depth ethnographic case study of data and algorithms in use at a public college in Ontario, Canada. We identify the data, algorithms, and outcomes in use at the college. We assess how the college's processes and relationships support those outcomes and the different stakeholders' perceptions of the college's data-driven systems. In addition, we find that the growing reliance on algorithmic decisions leads to increased student surveillance, exacerbation of existing inequities, and the automation of the faculty-student relationship. Finally, we identify a cycle of increased institutional power perpetuated by algorithmic decision-making, and driven by a push towards financial sustainability.
comment: In CHI '24 Proceedings of the CHI Conference on Human Factors in Computing Systems Honolulu, HI, USA
☆ BlendScape: Enabling Unified and Personalized Video-Conferencing Environments through Generative AI
Today's video-conferencing tools support a rich range of professional and social activities, but their generic, grid-based environments cannot be easily adapted to meet the varying needs of distributed collaborators. To enable end-user customization, we developed BlendScape, a system for meeting participants to compose video-conferencing environments tailored to their collaboration context by leveraging AI image generation techniques. BlendScape supports flexible representations of task spaces by blending users' physical or virtual backgrounds into unified environments and implements multimodal interaction techniques to steer the generation. Through an evaluation with 15 end-users, we investigated their customization preferences for work and social scenarios. Participants could rapidly express their design intentions with BlendScape and envisioned using the system to structure collaboration in future meetings, but experienced challenges with preventing distracting elements. We implement scenarios to demonstrate BlendScape's expressiveness in supporting distributed collaboration techniques from prior work and propose composition techniques to improve the quality of environments.
☆ Shortchanged: Uncovering and Analyzing Intimate Partner Financial Abuse in Consumer Complaints CHI '24
Digital financial services can introduce new digital-safety risks for users, particularly survivors of intimate partner financial abuse (IPFA). To offer improved support for such users, a comprehensive understanding of their support needs and the barriers they face to redress by financial institutions is essential. Drawing from a dataset of 2.7 million customer complaints, we implement a bespoke workflow that utilizes language-modeling techniques and expert human review to identify complaints describing IPFA. Our mixed-method analysis provides insight into the most common digital financial products involved in these attacks, and the barriers consumers report encountering when doing so. Our contributions are twofold; we offer the first human-labeled dataset for this overlooked harm and provide practical implications for technical practice, research, and design for better supporting and protecting survivors of IPFA.
comment: 20 pages, 9 figures, 8 tables, This paper will be published in CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems
♻ ☆ ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models CHI 2024
Exploring alternative ideas by rewriting text is integral to the writing process. State-of-the-art Large Language Models (LLMs) can simplify writing variation generation. However, current interfaces pose challenges for simultaneous consideration of multiple variations: creating new variations without overwriting text can be difficult, and pasting them sequentially can clutter documents, increasing workload and disrupting writers' flow. To tackle this, we present ABScribe, an interface that supports rapid, yet visually structured, exploration and organization of writing variations in human-AI co-writing tasks. With ABScribe, users can swiftly modify variations using LLM prompts, which are auto-converted into reusable buttons. Variations are stored adjacently within text fields for rapid in-place comparisons using mouse-over interactions on a popup toolbar. Our user study with 12 writers shows that ABScribe significantly reduces task workload (d = 1.20, p < 0.001), enhances user perceptions of the revision process (d = 2.41, p < 0.001) compared to a popular baseline workflow, and provides insights into how writers explore variations using LLMs.
comment: CHI 2024
♻ ☆ Understanding Nonlinear Collaboration between Human and AI Agents: A Co-design Framework for Creative Design CHI 2024
Creative design is a nonlinear process where designers generate diverse ideas in the pursuit of an open-ended goal and converge towards consensus through iterative remixing. In contrast, AI-powered design tools often employ a linear sequence of incremental and precise instructions to approximate design objectives. Such operations violate customary creative design practices and thus hinder AI agents' ability to complete creative design tasks. To explore better human-AI co-design tools, we first summarize human designers' practices through a formative study with 12 design experts. Taking graphic design as a representative scenario, we formulate a nonlinear human-AI co-design framework and develop a proof-of-concept prototype, OptiMuse. We evaluate OptiMuse and validate the nonlinear framework through a comparative study. We notice a subconscious change in people's attitudes towards AI agents, shifting from perceiving them as mere executors to regarding them as opinionated colleagues. This shift effectively fostered the exploration and reflection processes of individual designers.
comment: to be published in CHI 2024
♻ ☆ A Spatial-Constraint Model for Manipulating Static Visualizations
We propose a spatial-constraint approach for modeling spatial-based interactions and enabling interactive visualizations, which involves the manipulation of visualizations through selection, filtering, navigation, arrangement, and aggregation. We proposes a system that activates static visualizations by adding intelligent interactions, which is achieved by associating static visual objects with forces. Our force-directed technique facilitates smooth animated transitions of the visualizations between different interaction states. We showcase the effectiveness of our technique through usage scenarios that involve activating visualizations in real-world settings.
♻ ☆ ICE: Interactive 3D Game Character Editing via Dialogue
Text-driven in-game 3D character auto-customization systems eliminate the complicated process of manipulating intricate character control parameters. However, current methods are limited by their single-round generation, incapable of further editing and fine-grained modification. In this paper, we propose an Interactive Character Editing framework (ICE) to achieve a multi-round dialogue-based refinement process. In a nutshell, our ICE offers a more user-friendly way to enable players to convey creative ideas iteratively while ensuring that created characters align with the expectations of players. Specifically, we propose an Instruction Parsing Module (IPM) that utilizes large language models (LLMs) to parse multi-round dialogues into clear editing instruction prompts in each round. To reliably and swiftly modify character control parameters at a fine-grained level, we propose a Semantic-guided Low-dimension Parameter Solver (SLPS) that edits character control parameters according to prompts in a zero-shot manner. Our SLPS first localizes the character control parameters related to the fine-grained modification, and then optimizes the corresponding parameters in a low-dimension space to avoid unrealistic results. Extensive experimental results demonstrate the effectiveness of our proposed ICE for in-game character creation and the superior editing performance of ICE. Project page: https://iceedit.github.io/.
♻ ☆ PsyChat: A Client-Centric Dialogue System for Mental Health Support SC
Dialogue systems are increasingly integrated into mental health support to help clients facilitate exploration, gain insight, take action, and ultimately heal themselves. A practical and user-friendly dialogue system should be client-centric, focusing on the client's behaviors. However, existing dialogue systems publicly available for mental health support often concentrate solely on the counselor's strategies rather than the behaviors expressed by clients. This can lead to unreasonable or inappropriate counseling strategies and corresponding responses generated by the dialogue system. To address this issue, we propose PsyChat, a client-centric dialogue system that provides psychological support through online chat. The client-centric dialogue system comprises five modules: client behavior recognition, counselor strategy selection, input packer, response generator, and response selection. Both automatic and human evaluations demonstrate the effectiveness and practicality of our proposed dialogue system for real-life mental health support. Furthermore, the case study demonstrates that the dialogue system can predict the client's behaviors, select appropriate counselor strategies, and generate accurate and suitable responses.
comment: Accepted to CSCWD 2024 (27th International Conference on Computer Supported Cooperative Work in Design)
♻ ☆ Understanding the Factors Influencing Self-Managed Enterprises of Crowdworkers: A Comprehensive Review
This paper investigates the shift in crowdsourcing towards self-managed enterprises of crowdworkers (SMECs), diverging from traditional platform-controlled models. It reviews the literature to understand the foundational aspects of this shift, focusing on identifying key factors that may explain the rise of SMECs, particularly concerning power dynamics and tensions between Online Labor Platforms (OLPs) and crowdworkers. The study aims to guide future research and inform policy and platform development, emphasizing the importance of fair labor practices in this evolving landscape.
comment: 12 pages, 5 figures, ICEIS 2024 - 2024 International Conference on Enterprise Information Systems
♻ ☆ CoQuest: Exploring Research Question Co-Creation with an LLM-based Agent CHI 2024
Developing novel research questions (RQs) often requires extensive literature reviews, especially in interdisciplinary fields. To support RQ development through human-AI co-creation, we leveraged Large Language Models (LLMs) to build an LLM-based agent system named CoQuest. We conducted an experiment with 20 HCI researchers to examine the impact of two interaction designs: breadth-first and depth-first RQ generation. The findings revealed that participants perceived the breadth-first approach as more creative and trustworthy upon task completion. Conversely, during the task, participants considered the depth-first generated RQs as more creative. Additionally, we discovered that AI processing delays allowed users to reflect on multiple RQs simultaneously, leading to a higher quantity of generated RQs and an enhanced sense of control. Our work makes both theoretical and practical contributions by proposing and evaluating a mental model for human-AI co-creation of RQs. We also address potential ethical issues, such as biases and over-reliance on AI, advocating for using the system to improve human research creativity rather than automating scientific inquiry.
comment: Accepted to SIGCHI 2024
♻ ☆ An Image-based Typology for Visualization
We present and discuss the results of a qualitative analysis of visual representations from images. We labeled each image's essential stimuli, the removal of which would render a visualization uninterpretable. As a result, we derive a typology of 10 visualization types of defined groups. We describe the typology derivation process in which we engaged. The resulting typology and image analysis can serve a number of purposes: enabling researchers to study the evolution of the community and its research output over time, facilitating the categorization of visualization images for the purpose of research and teaching, allowing researchers and practitioners to identify visual design styles to further align the quantification of any visual information processor, be that a person or an algorithm observer, and it facilitates a discussion of standardization in visualization. In addition to the visualization typology from images, we provide a dataset of 6,833 tagged images and an online tool that can be used to explore and analyze the large set of labeled images. The tool and data set enable scholars to closely examine the diverse visual designs used and how they are published and communicated in our community. A pre-registration, a free copy of this paper, and all supplemental materials are available via osf.io/dxjwt.
comment: arXiv admin note: text overlap with arXiv:2209.07533
♻ ☆ Social Robots for Sleep Health: A Scoping Review
Poor sleep health is an increasingly concerning public healthcare crisis, especially when coupled with a dwindling number of health professionals qualified to combat it. However, there is a growing body of scientific literature on the use of digital technologies in supporting and sustaining individuals' healthy sleep habits. Social robots are a relatively recent technology that has been used to facilitate health care interventions and may have potential in improving sleep health outcomes, as well. Social robots' unique characteristics -- such as anthropomorphic physical embodiment or effective communication methods -- help to engage users and motivate them to comply with specific interventions, thus improving the interventions' outcomes. This scoping review aims to evaluate current scientific evidence for employing social robots in sleep health interventions, identify critical research gaps, and suggest future directions for developing and using social robots to improve people's sleep health. Our analysis of the reviewed studies found them limited due to a singular focus on the older adult population, use of small sample sizes, limited intervention durations, and other compounding factors. Nevertheless, the reviewed studies reported several positive outcomes, highlighting the potential social robots hold in this field. Although our review found limited clinical evidence for the efficacy of social robots as purveyors of sleep health interventions, it did elucidate the potential for a successful future in this domain if current limitations are addressed and more research is conducted.
Social and Information Networks
☆ NELA-PS: A Dataset of Pink Slime News Articles for the Study of Local News Ecosystems
Pink slime news outlets automatically produce low-quality, often partisan content that is framed as authentic local news. Given that local news is trusted by Americans and is increasingly shutting down due to financial distress, pink slime news outlets have the potential to exploit local information voids. Yet, there are gaps in understanding of pink slime production practices and tactics, particularly over time. Hence, to support future research in this area, we built a dataset of over 7.9M articles from 1093 pink slime sources over 2.5 years.
comment: published at ICWSM 2024 Dataset Track
☆ Enhancing Law Enforcement Training: A Gamified Approach to Detecting Terrorism Financing
Tools for fighting cyber-criminal activities using new technologies are promoted and deployed every day. However, too often, they are unnecessarily complex and hard to use, requiring deep domain and technical knowledge. These characteristics often limit the engagement of law enforcement and end-users in these technologies that, despite their potential, remain misunderstood. For this reason, in this study, we describe our experience in combining learning and training methods and the potential benefits of gamification to enhance technology transfer and increase adult learning. In fact, in this case, participants are experienced practitioners in professions/industries that are exposed to terrorism financing (such as Law Enforcement Officers, Financial Investigation Officers, private investigators, etc.) We define training activities on different levels for increasing the exchange of information about new trends and criminal modus operandi among and within law enforcement agencies, intensifying cross-border cooperation and supporting efforts to combat and prevent terrorism funding activities. On the other hand, a game (hackathon) is designed to address realistic challenges related to the dark net, crypto assets, new payment systems and dark web marketplaces that could be used for terrorist activities. The entire methodology was evaluated using quizzes, contest results, and engagement metrics. In particular, training events show about 60% of participants complete the 11-week training course, while the Hackathon results, gathered in two pilot studies (Madrid and The Hague), show increasing expertise among the participants (progression in the achieved points on average). At the same time, more than 70% of participants positively evaluate the use of the gamification approach, and more than 85% of them consider the implemented Use Cases suitable for their investigations.
☆ The Tribal Theater Model: Social Regulation for Dynamic User Adaptation in Virtual Interactive Environments
This paper proposes a social regulation model for dynamic adaptation according to user characteristics in virtual interactive environments, namely the tribal theater model. The model focuses on organizational regulation and builds an interaction scheme with more resilient user performance by improving the subjectivity of the user. This paper discusses the sociological theoretical basis of this model and how it was migrated to an engineering implementation of a virtual interactive environment. The model defines user interactions within a field that are regulated by a matrix through the allocation of resources. To verify the effectiveness of the tribal theater model, we designed an experimental scene using a chatroom as an example. We trained the matrix as an AI model using a temporal transformer and compared it with an interaction field with different levels of control. The experimental results showed that the tribal theater model can improve users' interactive experience, enhance resilient user performance, and effectively complete environmental interaction tasks under rule-based interaction.
comment: 20 pages, 6 figures
☆ Incentivizing News Consumption on Social Media Platforms Using Large Language Models and Realistic Bot Accounts
Polarization, declining trust, and wavering support for democratic norms are pressing threats to U.S. democracy. Exposure to verified and quality news may lower individual susceptibility to these threats and make citizens more resilient to misinformation, populism, and hyperpartisan rhetoric. This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment (from 1/19/2023 to 2/3/2023) on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. To further test differential effects by gender of the bots, treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our over-time intervention enhances the following of news media organization, the sharing and the liking of news content and the tweeting about politics and the liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control. Most of these results, however, were small in magnitude and confined to the already politically interested Twitter users, as indicated by their pre-treatment tweeting about politics. These findings have implications for social media and news organizations, and also offer direction for future work on how Large Language Models and other computational interventions can effectively enhance individual on-platform engagement with quality news and public affairs.
☆ USE: Dynamic User Modeling with Stateful Sequence Models
User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.
☆ Community Needs and Assets: A Computational Analysis of Community Conversations
A community needs assessment is a tool used by non-profits and government agencies to quantify the strengths and issues of a community, allowing them to allocate their resources better. Such approaches are transitioning towards leveraging social media conversations to analyze the needs of communities and the assets already present within them. However, manual analysis of exponentially increasing social media conversations is challenging. There is a gap in the present literature in computationally analyzing how community members discuss the strengths and needs of the community. To address this gap, we introduce the task of identifying, extracting, and categorizing community needs and assets from conversational data using sophisticated natural language processing methods. To facilitate this task, we introduce the first dataset about community needs and assets consisting of 3,511 conversations from Reddit, annotated using crowdsourced workers. Using this dataset, we evaluate an utterance-level classification model compared to sentiment classification and a popular large language model (in a zero-shot setting), where we find that our model outperforms both baselines at an F1 score of 94% compared to 49% and 61% respectively. Furthermore, we observe through our study that conversations about needs have negative sentiments and emotions, while conversations about assets focus on location and entities. The dataset is available at https://github.com/towhidabsar/CommunityNeeds.
☆ Dynamic Information Dissemination Model Incorporating Non-Adjacent Node Interaction
Describing the dynamics of information dissemination within social networks poses a formidable challenge. Despite multiple endeavors aimed at addressing this issue, only a limited number of studies have effectively replicated and forecasted the evolving course of information dissemination. In this paper, we propose a novel model, DM-NAI, which not only considers the information transfer between adjacent users but also takes into account the information transfer between non-adjacent users to comprehensively depict the information dissemination process. Extensive experiments are conducted on six datasets to predict the information dissemination range and the dissemination trend of the social network. The experimental results demonstrate an average prediction accuracy range of 94.62% to 96.71%, respectively, significantly outperforming state-of-the-art solutions. This finding illustrates that considering information transmission between non-adjacent users helps DM-NAI achieve more accurate information dissemination predictions.
comment: arXiv admin note: substantial text overlap with arXiv:2403.06385
☆ What makes a small-world network? Leveraging machine learning for the robust prediction and classification of networks
The ability to simulate realistic networks based on empirical data is an important task across scientific disciplines, from epidemiology to computer science. Often simulation approaches involve selecting a suitable network generative model such as Erd\"os-R\'enyi or small-world. However, few tools are available to quantify if a particular generative model is suitable for capturing a given network structure or organization. We utilize advances in interpretable machine learning to classify simulated networks by our generative models based on various network attributes, using both primary features and their interactions. Our study underscores the significance of specific network features and their interactions in distinguishing generative models, comprehending complex network structures, and forming real-world networks
☆ PureConnect: A Localized Social Media System to Increase Awareness and Connectedness in Environmental Justice Communities
Frequent disruptions like highway constructions are common now-a-days, often impacting environmental justice communities (communities with low socio-economic status with disproportionately high and adverse human health and environmental effects) that live nearby. Based on our interactions via focus groups with the members of four environmental justice communities impacted by a major highway construction, a common concern is a sense of uncertainty about project activities and loss of social connectedness, leading to increased stress, depression, anxiety and diminished well-being. This paper addresses this concern by developing a localized social media system called PureConnect with a goal to raise the level of awareness about the project and increase social connectedness among the community members. PureConnect has been designed using active engagement with four environmental justice communities affected by a major highway construction. It has been deployed in the real world among the members of the four environmental justice communities, and a detailed analysis of the data collected from this deployment as well as surveys show that PureConnect is potentially useful in improving community members' well-being and the members appreciate the functionalities it provides.
comment: Submitted in COMPSAC 2024
☆ Mathematical model of information bubbles on networks
The main goal of this paper to introduce a new model of evolvement of narratives (common opinions, information bubble) on networks. Our main tools come from invariant mean theory and graph theory. The case, when the root set of the network (influencers, news agencies, etc.) is ergodic is fully discussed. The other possibility, when the root contains more than one component is partially discussed and it could be a motivation for further research.
☆ Spatial-Temporal Graph Representation Learning for Tactical Networks Future State Prediction
Resource allocation in tactical ad-hoc networks presents unique challenges due to their dynamic and multi-hop nature. Accurate prediction of future network connectivity is essential for effective resource allocation in such environments. In this paper, we introduce the Spatial-Temporal Graph Encoder-Decoder (STGED) framework for Tactical Communication Networks that leverages both spatial and temporal features of network states to learn latent tactical behaviors effectively. STGED hierarchically utilizes graph-based attention mechanism to spatially encode a series of communication network states, leverages a recurrent neural network to temporally encode the evolution of states, and a fully-connected feed-forward network to decode the connectivity in the future state. Through extensive experiments, we demonstrate that STGED consistently outperforms baseline models by large margins across different time-steps input, achieving an accuracy of up to 99.2\% for the future state prediction task of tactical communication networks.
☆ Graph Neural Network for Crawling Target Nodes in Social Networks
Social networks crawling is in the focus of active research the last years. One of the challenging task is to collect target nodes in an initially unknown graph given a budget of crawling steps. Predicting a node property based on its partially known neighbourhood is at the heart of a successful crawler. In this paper we adopt graph neural networks for this purpose and show they are competitive to traditional classifiers and are better for individual cases. Additionally we suggest a training sample boosting technique, which helps to diversify the training set at early stages of crawling and thus improves the predictor quality. The experimental study on three types of target set topology indicates GNN based approach has a potential in crawling task, especially in the case of distributed target nodes.
♻ ☆ FairSNA: Algorithmic Fairness in Social Network Analysis
In recent years, designing fairness-aware methods has received much attention in various domains, including machine learning, natural language processing, and information retrieval. However, understanding structural bias and inequalities in social networks and designing fairness-aware methods for various research problems in social network analysis (SNA) have not received much attention. In this work, we highlight how the structural bias of social networks impacts the fairness of different SNA methods. We further discuss fairness aspects that should be considered while proposing network structure-based solutions for different SNA problems, such as link prediction, influence maximization, centrality ranking, and community detection. This paper clearly highlights that very few works have considered fairness and bias while proposing solutions; even these works are mainly focused on some research topics, such as link prediction, influence maximization, and PageRank. However, fairness has not yet been addressed for other research topics, such as influence blocking and community detection. We review state-of-the-art for different research topics in SNA, including the considered fairness constraints, their limitations, and our vision. This paper also covers evaluation metrics, available datasets, and synthetic network generating models used in such studies. Finally, we highlight various open research directions that require researchers' attention to bridge the gap between fairness and SNA.
♻ ☆ NetInfoF Framework: Measuring and Exploiting Network Usable Information ICLR 2024
Given a node-attributed graph, and a graph task (link prediction or node classification), can we tell if a graph neural network (GNN) will perform well? More specifically, do the graph structure and the node features carry enough usable information for the task? Our goals are (1) to develop a fast tool to measure how much information is in the graph structure and in the node features, and (2) to exploit the information to solve the task, if there is enough. We propose NetInfoF, a framework including NetInfoF_Probe and NetInfoF_Act, for the measurement and the exploitation of network usable information (NUI), respectively. Given a graph data, NetInfoF_Probe measures NUI without any model training, and NetInfoF_Act solves link prediction and node classification, while two modules share the same backbone. In summary, NetInfoF has following notable advantages: (a) General, handling both link prediction and node classification; (b) Principled, with theoretical guarantee and closed-form solution; (c) Effective, thanks to the proposed adjustment to node similarity; (d) Scalable, scaling linearly with the input size. In our carefully designed synthetic datasets, NetInfoF correctly identifies the ground truth of NUI and is the only method being robust to all graph scenarios. Applied on real-world datasets, NetInfoF wins in 11 out of 12 times on link prediction compared to general GNN baselines.
comment: Accepted to ICLR 2024 (Spotlight)
♻ ☆ Sparsification of the regularized magnetic Laplacian with multi-type spanning forests
In this paper, we consider a ${\rm U}(1)$-connection graph, that is, a graph where each oriented edge is endowed with a unit modulus complex number that is conjugated under orientation flip. A natural replacement for the combinatorial Laplacian is then the magnetic Laplacian, an Hermitian matrix that includes information about the graph's connection. Magnetic Laplacians appear, e.g., in the problem of angular synchronization. In the context of large and dense graphs, we study here sparsifiers of the magnetic Laplacian $\Delta$, i.e., spectral approximations based on subgraphs with few edges. Our approach relies on sampling multi-type spanning forests (MTSFs) using a custom determinantal point process, a probability distribution over edges that favours diversity. In a word, an MTSF is a spanning subgraph whose connected components are either trees or cycle-rooted trees. The latter partially capture the angular inconsistencies of the connection graph, and thus provide a way to compress the information contained in the connection. Interestingly, when the connection graph has weakly inconsistent cycles, samples from the determinantal point process under consideration can be obtained \`a la Wilson, using a random walk with cycle popping. We provide statistical guarantees for a choice of natural estimators of the connection Laplacian, and investigate two practical applications of our sparsifiers: ranking with angular synchronization and graph-based semi-supervised learning. From a statistical perspective, a side result of this paper of independent interest is a matrix Chernoff bound with intrinsic dimension, which allows considering the influence of a regularization -- of the form $\Delta + q \mathbb{I}$ with $q>0$ -- on sparsification guarantees.
comment: 51 pages, 15 figures. Improved presentation of the theoretical results and simulations of larger scale
♻ ☆ Analyzing User Engagement with TikTok's Short Format Video Recommendations using Data Donations CHI
Short-format videos have exploded on platforms like TikTok, Instagram, and YouTube. Despite this, the research community lacks large-scale empirical studies into how people engage with short-format videos and the role of recommendation systems that offer endless streams of such content. In this work, we analyze user engagement on TikTok using data we collect via a data donation system that allows TikTok users to donate their data. We recruited 347 TikTok users and collected 9.2M TikTok video recommendations they received. By analyzing user engagement, we find that the average daily usage time increases over the users' lifetime while the user attention remains stable at around 45%. We also find that users like more videos uploaded by people they follow than those recommended by people they do not follow. Our study offers valuable insights into how users engage with short-format videos on TikTok and lessons learned from designing a data donation system.
comment: CHI Conference on Human Factors in Computing Systems 2024 (CHI '24)
♻ ☆ Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites
As large language models (LLMs) like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. To do this, we train a DeBERTa-based synthetic news detector and classify over 15.46 million articles from 3,074 misinformation and mainstream news websites. We find that between January 1, 2022, and May 1, 2023, the relative number of synthetic news articles increased by 57.3% on mainstream websites while increasing by 474% on misinformation sites. We find that this increase is largely driven by smaller less popular websites. Analyzing the impact of the release of ChatGPT using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites.
comment: Accepted to ICWSM 2024
♻ ☆ PAGE: Prototype-Based Model-Level Explanations for Graph Neural Networks AAAI-22
Aside from graph neural networks (GNNs) attracting significant attention as a powerful framework revolutionizing graph representation learning, there has been an increasing demand for explaining GNN models. Although various explanation methods for GNNs have been developed, most studies have focused on instance-level explanations, which produce explanations tailored to a given graph instance. In our study, we propose Prototype-bAsed GNN-Explainer (PAGE), a novel model-level GNN explanation method that explains what the underlying GNN model has learned for graph classification by discovering human-interpretable prototype graphs. Our method produces explanations for a given class, thus being capable of offering more concise and comprehensive explanations than those of instance-level explanations. First, PAGE selects embeddings of class-discriminative input graphs on the graph-level embedding space after clustering them. Then, PAGE discovers a common subgraph pattern by iteratively searching for high matching node tuples using node-level embeddings via a prototype scoring function, thereby yielding a prototype graph as our explanation. Using six graph classification datasets, we demonstrate that PAGE qualitatively and quantitatively outperforms the state-of-the-art model-level explanation method. We also carry out systematic experimental studies by demonstrating the relationship between PAGE and instance-level explanation methods, the robustness of PAGE to input data scarce environments, and the computational efficiency of the proposed prototype scoring function in PAGE.
comment: 18 pages, 12 figures, 5 tables; to appear in the IEEE Transactions on Pattern Analysis and Machine Intelligence (Please cite our journal version that will appear in an upcoming issue. Its two-page extended summary was presented in the AAAI-22 Student Abstract and Poster Program.)
♻ ☆ Stochastic gradient descent-based inference for dynamic network models with attractors
In Coevolving Latent Space Networks with Attractors (CLSNA) models, nodes in a latent space represent social actors, and edges indicate their dynamic interactions. Attractors are added at the latent level to capture the notion of attractive and repulsive forces between nodes, borrowing from dynamical systems theory. However, CLSNA reliance on MCMC estimation makes scaling difficult, and the requirement for nodes to be present throughout the study period limit practical applications. We address these issues by (i) introducing a Stochastic gradient descent (SGD) parameter estimation method, (ii) developing a novel approach for uncertainty quantification using SGD, and (iii) extending the model to allow nodes to join and leave over time. Simulation results show that our extensions result in little loss of accuracy compared to MCMC, but can scale to much larger networks. We apply our approach to the longitudinal social networks of members of US Congress on the social media platform X. Accounting for node dynamics overcomes selection bias in the network and uncovers uniquely and increasingly repulsive forces within the Republican Party.
♻ ☆ Hub-aware Random Walk Graph Embedding Methods for Classification
In the last two decades we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector-based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general-purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualisation and link prediction. In this paper, we propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem. Random walk sampling strategies of the proposed algorithms have been designed to pay special attention to hubs -- high-degree nodes that have the most critical role for the overall connectedness in large-scale graphs. The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real-world networks. The obtained results indicate that our methods considerably improve the predictive power of examined classifiers compared to currently the most popular random walk method for generating general-purpose graph embeddings (node2vec).
comment: Submitted to journal for possible publication
♻ ☆ Moral Judgments in Narratives on Reddit: Investigating Moral Sparks via Social Commonsense and Linguistic Signals
Machine ethics ensures ethical conduct in Artificial Intelligence (AI) models and agents. Examining real-life applications benefit learning practical ethics in many situations, offering valuable data to grasp the complexities of human ethics in diverse contexts. In this paper, we examine social media platforms for understanding real-life ethical scenarios and human moral judgments. We examine posts from a popular Reddit subreddit (i.e., a subcommunity) called r/AmITheAsshole, where authors and commenters share their moral judgments on who is blameworthy. We employ computational techniques to investigate the underlying reasoning influencing moral judgments. We focus on excerpts-which we term moral sparks-from original posts that commenters include to indicate what motivates their judgments. To this end, we examine how (1) events activating social commonsense and (2) linguistic signals affect moral sparks assignment and their subsequent judgments. By examining over 24 672 posts and 175988 comments, we find that event-related negative character traits (e.g., immature and rude) attract attention and stimulate blame, implying a dependent relationship between character traits and moral values. Specially, we focus on causal graph involving events (c-events) that activate social commonsense. We observe that c-events are perceived with varying levels of informativeness, influencing moral spark and judgment assignment in distinct ways. This observation is reinforced by examining linguistic features describing semantically similar c-events. Moreover, language influencing commenters' cognitive processes enhances the probability of an excerpt becoming a moral spark, while factual and concrete descriptions tend to inhibit this effect.
Distributed, Parallel, and Cluster Computing
☆ Agent-based MST Construction
{\em Minimum-weight spanning tree} (MST) is one of the fundamental and well-studied problems in distributed computing. In this paper, we initiate the study of constructing MST using mobile agents (aka robots). Suppose $n$ agents are positioned initially arbitrarily on the nodes of a connected, undirected, arbitrary, anonymous, port-labeled, weighted $n$-node, $m$-edge graph $G$ of diameter $D$ and maximum degree $\Delta$. The agents relocate themselves autonomously and compute an MST of $G$ such that exactly one agent positions on a node and tracks in its memory which of its adjacent edges belong to the MST. The objective is to minimize time and memory requirements. Following the literature, we consider the synchronous setting in which each agent performs its operations synchronously with others and hence time can be measured in rounds. We first establish a generic result: if $n$ and $\Delta$ are known a priori and memory per agent is as much as node memory in the message-passing model (of distributed computing), agents can simulate any $O(T)$-round deterministic algorithm for any problem in the message-passing model to the agent model in $O(\Delta T \log n+n\log^2n)$ rounds. As a corollary, MST can be constructed in the agent model in $O(\max\{\Delta \sqrt{n} \log n \log^*n, \Delta D \log n,n\log^2n\})$ rounds simulating the celebrated $O(\sqrt{n} \log^*n +D)$-round GKP algorithm for MST in the message-passing model. We then establish that, without knowing any graph parameter a priori, there exists a deterministic algorithm to construct MST in the agent model in $O(m+n\log n)$ rounds with $O(n \log n)$ bits memory at each agent. The presented algorithm needs to overcome highly non-trivial challenges on how to synchronize agents in computing MST as they may initially be positioned arbitrarily on the graph nodes.
comment: 26 pages
☆ How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures
The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral , we design novel elastically relaxed, lock-free queues and stacks capable of reconfiguring relaxation during run time. We establish linearizability and define upper bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs hold up against state-of-the-art statically relaxed designs, while also swiftly managing trade-offs between relaxation and operational latency. We also outline how to use the Lateral to design elastically relaxed lock-free counters and deques.
☆ CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows
Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated checkpoints - an extension of the original Chandy-Lamport checkpoints from 1985. However, the reasons behind this prevalence of the coordinated approach remain anecdotal, as reported by practitioners of the stream processing community. At the same time, common checkpointing approaches, such as the uncoordinated and the communication-induced ones, remain largely unexplored. This paper is the first to address this gap by i) shedding light on why practitioners have favored the coordinated approach and ii) by investigating whether there are viable alternatives. To this end, we implement three checkpointing approaches that we surveyed and adapted for the distinct needs of streaming dataflows. Our analysis shows that the coordinated approach outperforms the uncoordinated and communication-induced protocols under uniformly distributed workloads. To our surprise, however, the uncoordinated approach is not only competitive to the coordinated one in uniformly distributed workloads, but it also outperforms the coordinated approach in skewed workloads. We conclude that rather than blindly employing coordinated checkpointing, research should focus on optimizing the very promising uncoordinated approach, as it can address issues with skew and support prevalent cyclic queries. We believe that our findings can trigger further research into checkpointing mechanisms.
☆ Dynamic Resource Allocation for Virtual Machine Migration Optimization using Machine Learning
The paragraph is grammatically correct and logically coherent. It discusses the importance of mobile terminal cloud computing migration technology in meeting the demands of evolving computer and cloud computing technologies. It emphasizes the need for efficient data access and storage, as well as the utilization of cloud computing migration technology to prevent additional time delays. The paragraph also highlights the contributions of cloud computing migration technology to expanding cloud computing services. Additionally, it acknowledges the role of virtualization as a fundamental capability of cloud computing while emphasizing that cloud computing and virtualization are not inherently interconnected. Finally, it introduces machine learning-based virtual machine migration optimization and dynamic resource allocation as a critical research direction in cloud computing, citing the limitations of static rules or manual settings in traditional cloud computing environments. Overall, the paragraph effectively communicates the importance of machine learning technology in addressing resource allocation and virtual machine migration challenges in cloud computing.
☆ Adaptive time step selection for Spectral Deferred Corrections
Spectral Deferred Corrections (SDC) is an iterative method for the numerical solution of ordinary differential equations. It works by refining the numerical solution for an initial value problem by approximately solving differential equations for the error, and can be interpreted as a preconditioned fixed-point iteration for solving the fully implicit collocation problem. We adopt techniques from embedded Runge-Kutta Methods (RKM) to SDC in order to provide a mechanism for adaptive time step size selection and thus increase computational efficiency of SDC. We propose two SDC-specific estimates of the local error that are generic and require only minimal problem specific tuning. We demonstrate a gain in efficiency over standard SDC with fixed step size, compare efficiency favorably against state-of-the-art adaptive RKM and show that due to its iterative nature, adaptive SDC can cope efficiently with silent data corruption.
comment: 34 pages including references, 12 figures. Submitted to Springer Numerical Algorithms
☆ Optimal Fixed Priority Scheduling in Multi-Stage Multi-Resource Distributed Real-Time Systems DATE
This work studies fixed priority (FP) scheduling of real-time jobs with end-to-end deadlines in a distributed system. Specifically, given a multi-stage pipeline with multiple heterogeneous resources of the same type at each stage, the problem is to assign priorities to a set of real-time jobs with different release times to access a resource at each stage of the pipeline subject to the end-to-end deadline constraints. Note, in such a system, jobs may compete with different sets of jobs at different stages of the pipeline depending on the job-to-resource mapping. To this end, following are the two major contributions of this work. We show that an OPA-compatible schedulability test based on the delay composition algebra can be constructed, which we then use with an optimal priority assignment algorithm to compute a priority ordering. Further, we establish the versatility of pairwise priority assignment in such a multi-stage multi-resource system, compared to a total priority ordering. In particular, we show that a pairwise priority assignment may be feasible even if a priority ordering does not exist. We propose an integer linear programming formulation and a scalable heuristic to compute a pairwise priority assignment. We also show through simulation experiments that the proposed approaches can be used for the holistic scheduling of real-time jobs in edge computing systems.
comment: Accepted in DATE (Design, Automation and Test in Europe Conference) 2024
☆ Causal Graph Dynamics and Kan Extensions
On the one side, the formalism of Global Transformations comes with the claim of capturing any transformation of space that is local, synchronous and deterministic.The claim has been proven for different classes of models such as mesh refinements from computer graphics, Lindenmayer systems from morphogenesis modeling and cellular automata from biological, physical and parallel computation modeling.The Global Transformation formalism achieves this by using category theory for its genericity, and more precisely the notion of Kan extension to determine the global behaviors based on the local ones.On the other side, Causal Graph Dynamics describe the transformation of port graphs in a synchronous and deterministic way and has not yet being tackled.In this paper, we show the precise sense in which the claim of Global Transformations holds for them as well.This is done by showing different ways in which they can be expressed as Kan extensions, each of them highlighting different features of Causal Graph Dynamics.Along the way, this work uncovers the interesting class of Monotonic Causal Graph Dynamics and their universality among General Causal Graph Dynamics.
☆ Regent based parallel meshfree LSKUM solver for heterogenous HPC platforms
Regent is an implicitly parallel programming language that allows the development of a single codebase for heterogeneous platforms targeting CPUs and GPUs. This paper presents the development of a parallel meshfree solver in Regent for two-dimensional inviscid compressible flows. The meshfree solver is based on the least squares kinetic upwind method. Example codes are presented to show the difference between the Regent and CUDA-C implementations of the meshfree solver on a GPU node. For CPU parallel computations, details are presented on how the data communication and synchronisation are handled by Regent and Fortran+MPI codes. The Regent solver is verified by applying it to the standard test cases for inviscid flows. Benchmark simulations are performed on coarse to very fine point distributions to assess the solver's performance. The computational efficiency of the Regent solver on an A100 GPU is compared with an equivalent meshfree solver written in CUDA-C. The codes are then profiled to investigate the differences in their performance. The performance of the Regent solver on CPU cores is compared with an equivalent explicitly parallel Fortran meshfree solver based on MPI. Scalability results are shown to offer insights into performance.
☆ Decentralized Federated Learning: Model Update Tracking Under Imperfect Information Sharing
A novel Decentralized Noisy Model Update Tracking Federated Learning algorithm (FedNMUT) is proposed, which is tailored to function efficiently in the presence of noisy communication channels that reflect imperfect information exchange. This algorithm uses gradient tracking to minimize the impact of data heterogeneity while minimizing communication overhead. The proposed algorithm incorporates noise into its parameters to mimic the conditions of noisy communication channels, thereby enabling consensus among clients through a communication graph topology in such challenging environments. FedNMUT prioritizes parameter sharing and noise incorporation to increase the resilience of decentralized learning systems against noisy communications. Through theoretical and empirical validation, it is demonstrated that the performance of FedNMUT is superior compared to the existing state-of-the-art methods and conventional parameter-mixing approaches in dealing with imperfect information sharing. This proves the capability of the proposed algorithm to counteract the negative effects of communication noise in a decentralized learning framework.
comment: arXiv admin note: substantial text overlap with arXiv:2303.10695
☆ Federated reinforcement learning for robot motion planning with zero-shot generalization
This paper considers the problem of learning a control policy for robot motion planning with zero-shot generalization, i.e., no data collection and policy adaptation is needed when the learned policy is deployed in new environments. We develop a federated reinforcement learning framework that enables collaborative learning of multiple learners and a central server, i.e., the Cloud, without sharing their raw data. In each iteration, each learner uploads its local control policy and the corresponding estimated normalized arrival time to the Cloud, which then computes the global optimum among the learners and broadcasts the optimal policy to the learners. Each learner then selects between its local control policy and that from the Cloud for next iteration. The proposed framework leverages on the derived zero-shot generalization guarantees on arrival time and safety. Theoretical guarantees on almost-sure convergence, almost consensus, Pareto improvement and optimality gap are also provided. Monte Carlo simulation is conducted to evaluate the proposed framework.
☆ Automated Calibration of Parallel and Distributed Computing Simulators: A Case Study
Many parallel and distributed computing research results are obtained in simulation, using simulators that mimic real-world executions on some target system. Each such simulator is configured by picking values for parameters that define the behavior of the underlying simulation models it implements. The main concern for a simulator is accuracy: simulated behaviors should be as close as possible to those observed in the real-world target system. This requires that values for each of the simulator's parameters be carefully picked, or "calibrated," based on ground-truth real-world executions. Examining the current state of the art shows that simulator calibration, at least in the field of parallel and distributed computing, is often undocumented (and thus perhaps often not performed) and, when documented, is described as a labor-intensive, manual process. In this work we evaluate the benefit of automating simulation calibration using simple algorithms. Specifically, we use a real-world case study from the field of High Energy Physics and compare automated calibration to calibration performed by a domain scientist. Our main finding is that automated calibration is on par with or significantly outperforms the calibration performed by the domain scientist. Furthermore, automated calibration makes it straightforward to operate desirable trade-offs between simulation accuracy and simulation speed.
comment: To appear in Proc. of the 25th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2024)
♻ ☆ Accelerating Biclique Counting on GPU ICDE24
Counting (p,q)-bicliques in bipartite graphs poses a foundational challenge with broad applications, from densest subgraph discovery in algorithmic research to personalized content recommendation in practical scenarios. Despite its significance, current leading (p,q)-biclique counting algorithms fall short, particularly when faced with larger graph sizes and clique scales. Fortunately, the problem's inherent structure, allowing for the independent counting of each biclique starting from every vertex, combined with a substantial set intersections, makes it highly amenable to parallelization. Recent successes in GPU-accelerated algorithms across various domains motivate our exploration into harnessing the parallelism power of GPUs to efficiently address the (p,q)-biclique counting challenge. We introduce GBC (GPU-based Biclique Counting), a novel approach designed to enable efficient and scalable (p,q)-biclique counting on GPUs. To address major bottleneck arising from redundant comparisons in set intersections (occupying an average of 90% of the runtime), we introduce a novel data structure that hashes adjacency lists into truncated bitmaps to enable efficient set intersection on GPUs via bit-wise AND operations. Our innovative hybrid DFS-BFS exploration strategy further enhances thread utilization and effectively manages memory constraints. A composite load balancing strategy, integrating pre-runtime and runtime workload allocation, ensures equitable distribution among threads. Additionally, we employ vertex reordering and graph partitioning strategies for improved compactness and scalability. Experimental evaluations on eight real-life and two synthetic datasets demonstrate that GBC outperforms state-of-the-art algorithms by a substantial margin. In particular, GBC achieves an average speedup of 497.8x, with the largest instance achieving a remarkable 1217.7x speedup when p = q = 8.
comment: This paper has been accepted by ICDE24
♻ ☆ The Topology of Local Computing in Networks
Modeling distributed computing in a way enabling the use of formal methods is a challenge that has been approached from different angles, among which two techniques emerged at the turn of the century: protocol complexes, and directed algebraic topology. In both cases, the considered computational model generally assumes communication via shared objects, typically a shared memory consisting of a collection of read-write registers. Our paper is concerned with network computing, where the processes are located at the nodes of a network, and communicate by exchanging messages along the edges of that network. Applying the topological approach for verification in network computing is a considerable challenge, mainly because the presence of identifiers assigned to the nodes yields protocol complexes whose size grows exponentially with the size of the underlying network. However, many of the problems studied in this context are of local nature, and their definitions do not depend on the identifiers or on the size of the network. We leverage this independence in order to meet the above challenge, and present $\textit{local}$ protocol complexes, whose sizes do not depend on the size of the network. As an application of the design of "compact" protocol complexes, we reformulate the celebrated lower bound of $\Omega(\log^*n)$ rounds for 3-coloring the $n$-node ring, in the algebraic topology framework.
♻ ☆ Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to suboptimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.
♻ ☆ Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective VLDB 2024
Many Graph Neural Network (GNN) training systems have emerged recently to support efficient GNN training. Since GNNs embody complex data dependencies between training samples, the training of GNNs should address distinct challenges different from DNN training in data management, such as data partitioning, batch preparation for mini-batch training, and data transferring between CPUs and GPUs. These factors, which take up a large proportion of training time, make data management in GNN training more significant. This paper reviews GNN training from a data management perspective and provides a comprehensive analysis and evaluation of the representative approaches. We conduct extensive experiments on various benchmark datasets and show many interesting and valuable results. We also provide some practical tips learned from these experiments, which are helpful for designing GNN training systems in the future.
comment: 12 pages, 18 figures. (Accepted by VLDB 2024)
Machine Learning
☆ On Pretraining Data Diversity for Self-Supervised Learning
We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models will be available at https://github.com/hammoudhasan/DiversitySSL .
comment: Under review
☆ Editing Massive Concepts in Text-to-Image Diffusion Models
Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.
comment: Project page: https://silentview.github.io/EMCID/ . Code: https://github.com/SilentView/EMCID
☆ RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.
comment: Project: https://github.com/Liuziyu77/RAR
☆ Learning from Models and Data for Visual Grounding
We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%.
comment: Project Page: https://catherine-r-he.github.io/SynGround/
☆ ZigMa: Zigzag Mamba Diffusion Model
The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$. Code will be released at https://taohu.me/zigma/
comment: Project Page: https://taohu.me/zigma/
☆ Hierarchical NeuroSymbolic Approach for Action Quality Assessment
Action quality assessment (AQA) applies computer vision to quantitatively assess the performance or execution of a human action. Current AQA approaches are end-to-end neural models, which lack transparency and tend to be biased because they are trained on subjective human judgements as ground-truth. To address these issues, we introduce a neuro-symbolic paradigm for AQA, which uses neural networks to abstract interpretable symbols from video data and makes quality assessments by applying rules to those symbols. We take diving as the case study. We found that domain experts prefer our system and find it more informative than purely neural approaches to AQA in diving. Our system also achieves state-of-the-art action recognition and temporal segmentation, and automatically generates a detailed report that breaks the dive down into its elements and provides objective scoring with visual evidence. As verified by a group of domain experts, this report may be used to assist judges in scoring, help train judges, and provide feedback to divers. We will open-source all of our annotated training data and code for ease of reproducibility.
☆ Bridge the Modality and Capacity Gaps in Vision-Language Model Selection
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. Thus, a promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" -- the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" -- the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps. SWAB first adopts optimal transport to capture the relevance between open-source datasets and target dataset with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging those two gaps and enhancing the VLM's capacity estimation for VLM selection. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.
☆ Evaluating Frontier Models for Dangerous Capabilities
To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.
☆ RewardBench: Evaluating Reward Models for Language Modeling
Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.
comment: 40 pages, 19 figures, 12 tables
☆ Towards an extension of Fault Trees in the Predictive Maintenance Scenario
One of the most appreciated features of Fault Trees (FTs) is their simplicity, making them fit into industrial processes. As such processes evolve in time, considering new aspects of large modern systems, modelling techniques based on FTs have adapted to these needs. This paper proposes an extension of FTs to take into account the problem of Predictive Maintenance, one of the challenges of the modern dependability field of study. The paper sketches the Predictive Fault Tree language and proposes some use cases to support their modelling and analysis in concrete industrial settings.
comment: S. Bernardi, T. Zoppi (Editors), Fast Abstracts and Student Forum Proceedings - EDCC 2024 - 19th European Dependable Computing Conference, Leuven, Belgium, 8-11 April 2024
☆ The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI
Generative AI (GAI) offers unprecedented possibilities but its commercialization has raised concerns about transparency, reproducibility, bias, and safety. Many "open-source" GAI models lack the necessary components for full understanding and reproduction, and some use restrictive licenses, a practice known as "openwashing." We propose the Model Openness Framework (MOF), a ranked classification system that rates machine learning models based on their completeness and openness, following principles of open science, open source, open data, and open access. The MOF requires specific components of the model development lifecycle to be included and released under appropriate open licenses. This framework aims to prevent misrepresentation of models claiming to be open, guide researchers and developers in providing all model components under permissive licenses, and help companies, academia, and hobbyists identify models that can be safely adopted without restrictions. Wide adoption of the MOF will foster a more open AI ecosystem, accelerating research, innovation, and adoption.
comment: 45 pages
☆ Sparse Implementation of Versatile Graph-Informed Layers
Graph Neural Networks (GNNs) have emerged as effective tools for learning tasks on graph-structured data. Recently, Graph-Informed (GI) layers were introduced to address regression tasks on graph nodes, extending their applicability beyond classic GNNs. However, existing implementations of GI layers lack efficiency due to dense memory allocation. This paper presents a sparse implementation of GI layers, leveraging the sparsity of adjacency matrices to reduce memory usage significantly. Additionally, a versatile general form of GI layers is introduced, enabling their application to subsets of graph nodes. The proposed sparse implementation improves the concrete computational efficiency and scalability of the GI layers, permitting to build deeper Graph-Informed Neural Networks (GINNs) and facilitating their scalability to larger graphs.
☆ Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models
In this paper, we propose Describe-and-Dissect (DnD), a novel method to describe the roles of hidden neurons in vision networks. DnD utilizes recent advancements in multimodal deep learning to produce complex natural language descriptions, without the need for labeled training data or a predefined set of concepts to choose from. Additionally, DnD is training-free, meaning we don't train any new models and can easily leverage more capable general purpose models in the future. We have conducted extensive qualitative and quantitative analysis to show that DnD outperforms prior work by providing higher quality neuron descriptions. Specifically, our method on average provides the highest quality labels and is more than 2 times as likely to be selected as the best explanation for a neuron than the best baseline.
☆ Towards Principled Representation Learning from Videos for Reinforcement Learning ICLR 2024
We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive learning and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that the sample complexity of learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, yielding results that are consistent with our theoretical findings.
comment: ICLR 2024 Spotlight Conference Paper
☆ Weisfeiler and Leman Go Loopy: A New Hierarchy for Graph Representational Learning ICLR 2024
We introduce $r$-loopy Weisfeiler-Leman ($r$-$\ell{}$WL), a novel hierarchy of graph isomorphism tests and a corresponding GNN framework, $r$-$\ell{}$MPNN, that can count cycles up to length $r + 2$. Most notably, we show that $r$-$\ell{}$WL can count homomorphisms of cactus graphs. This strictly extends classical 1-WL, which can only count homomorphisms of trees and, in fact, is incomparable to $k$-WL for any fixed $k$. We empirically validate the expressive and counting power of the proposed $r$-$\ell{}$MPNN on several synthetic datasets and present state-of-the-art predictive performance on various real-world datasets. The code is available at https://github.com/RPaolino/loopy
comment: Accepted at ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning
☆ An Ordering of Divergences for Variational Inference with Factorized Gaussian Approximations
Given an intractable distribution $p$, the problem of variational inference (VI) is to compute the best approximation $q$ from some more tractable family $\mathcal{Q}$. Most commonly the approximation is found by minimizing a Kullback-Leibler (KL) divergence. However, there exist other valid choices of divergences, and when $\mathcal{Q}$ does not contain~$p$, each divergence champions a different solution. We analyze how the choice of divergence affects the outcome of VI when a Gaussian with a dense covariance matrix is approximated by a Gaussian with a diagonal covariance matrix. In this setting we show that different divergences can be \textit{ordered} by the amount that their variational approximations misestimate various measures of uncertainty, such as the variance, precision, and entropy. We also derive an impossibility theorem showing that no two of these measures can be simultaneously matched by a factorized approximation; hence, the choice of divergence informs which measure, if any, is correctly estimated. Our analysis covers the KL divergence, the R\'enyi divergences, and a score-based divergence that compares $\nabla\log p$ and $\nabla\log q$. We empirically evaluate whether these orderings hold when VI is used to approximate non-Gaussian distributions.
☆ Uncertainty-Aware Explanations Through Probabilistic Self-Explainable Neural Networks
The lack of transparency of Deep Neural Networks continues to be a limitation that severely undermines their reliability and usage in high-stakes applications. Promising approaches to overcome such limitations are Prototype-Based Self-Explainable Neural Networks (PSENNs), whose predictions rely on the similarity between the input at hand and a set of prototypical representations of the output classes, offering therefore a deep, yet transparent-by-design, architecture. So far, such models have been designed by considering pointwise estimates for the prototypes, which remain fixed after the learning phase of the model. In this paper, we introduce a probabilistic reformulation of PSENNs, called Prob-PSENN, which replaces point estimates for the prototypes with probability distributions over their values. This provides not only a more flexible framework for an end-to-end learning of prototypes, but can also capture the explanatory uncertainty of the model, which is a missing feature in previous approaches. In addition, since the prototypes determine both the explanation and the prediction, Prob-PSENNs allow us to detect when the model is making uninformed or uncertain predictions, and to obtain valid explanations for them. Our experiments demonstrate that Prob-PSENNs provide more meaningful and robust explanations than their non-probabilistic counterparts, thus enhancing the explainability and reliability of the models.
☆ Reinforcement Learning for Online Testing of Autonomous Driving Systems: a Replication and Extension Study
In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting or useless feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random testing. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing.
☆ M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization via Multiplier Induced Loss Landscape Scheduling
When a neural network parameterized loss function consists of many terms, the combinatorial choice of weight multipliers during the optimization process forms a challenging problem. To address this, we proposed a probabilistic graphical model (PGM) for the joint model parameter and multiplier evolution process, with a hypervolume based likelihood that promotes multi-objective descent of each loss term. The corresponding parameter and multiplier estimation as a sequential decision process is then cast into an optimal control problem, where the multi-objective descent goal is dispatched hierarchically into a series of constraint optimization sub-problems. The sub-problem constraint automatically adapts itself according to Pareto dominance and serves as the setpoint for the low level multiplier controller to schedule loss landscapes via output feedback of each loss term. Our method is multiplier-free and operates at the timescale of epochs, thus saves tremendous computational resources compared to full training cycle multiplier tuning. We applied it to domain invariant variational auto-encoding with 6 loss terms on the PACS domain generalization task, and observed robust performance across a range of controller hyperparameters, as well as different multiplier initial conditions, outperforming other multiplier scheduling methods. We offered modular implementation of our method, admitting custom definition of many loss terms for applying our multi-objective hierarchical output feedback training scheme to other deep learning fields.
☆ Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes
We propose a framework for probabilistic forecasting of dynamical systems based on generative modeling. Given observations of the system state over time, we formulate the forecasting problem as sampling from the conditional distribution of the future system state given its current state. To this end, we leverage the framework of stochastic interpolants, which facilitates the construction of a generative model between an arbitrary base distribution and the target. We design a fictitious, non-physical stochastic dynamics that takes as initial condition the current system state and produces as output a sample from the target conditional distribution in finite time and without bias. This process therefore maps a point mass centered at the current state onto a probabilistic ensemble of forecasts. We prove that the drift coefficient entering the stochastic differential equation (SDE) achieving this task is non-singular, and that it can be learned efficiently by square loss regression over the time-series data. We show that the drift and the diffusion coefficients of this SDE can be adjusted after training, and that a specific choice that minimizes the impact of the estimation error gives a F\"ollmer process. We highlight the utility of our approach on several complex, high-dimensional forecasting problems, including stochastically forced Navier-Stokes and video prediction on the KTH and CLEVRER datasets.
☆ Improving the Adaptive Moment Estimation (ADAM) stochastic optimizer through an Implicit-Explicit (IMEX) time-stepping approach
The Adam optimizer, often used in Machine Learning for neural network training, corresponds to an underlying ordinary differential equation (ODE) in the limit of very small learning rates. This work shows that the classical Adam algorithm is a first order implicit-explicit (IMEX) Euler discretization of the underlying ODE. Employing the time discretization point of view, we propose new extensions of the Adam scheme obtained by using higher order IMEX methods to solve the ODE. Based on this approach, we derive a new optimization algorithm for neural network training that performs better than classical Adam on several regression and classification problems.
☆ What Matters for Active Texture Recognition With Vision-Based Tactile Sensors ICRA
This paper explores active sensing strategies that employ vision-based tactile sensors for robotic perception and classification of fabric textures. We formalize the active sampling problem in the context of tactile fabric recognition and provide an implementation of information-theoretic exploration strategies based on minimizing predictive entropy and variance of probabilistic models. Through ablation studies and human experiments, we investigate which components are crucial for quick and reliable texture recognition. Along with the active sampling strategies, we evaluate neural network architectures, representations of uncertainty, influence of data augmentation, and dataset variability. By evaluating our method on a previously published Active Clothing Perception Dataset and on a real robotic system, we establish that the choice of the active exploration strategy has only a minor influence on the recognition accuracy, whereas data augmentation and dropout rate play a significantly larger role. In a comparison study, while humans achieve 66.9% recognition accuracy, our best approach reaches 90.0% in under 5 touches, highlighting that vision-based tactile sensors are highly effective for fabric texture recognition.
comment: 7 pages, 9 figures, accepted at 2024 IEEE International Conference on Robotics and Automation (ICRA)
☆ Loss Regularizing Robotic Terrain Classification
Locomotion mechanics of legged robots are suitable when pacing through difficult terrains. Recognising terrains for such robots are important to fully yoke the versatility of their movements. Consequently, robotic terrain classification becomes significant to classify terrains in real time with high accuracy. The conventional classifiers suffer from overfitting problem, low accuracy problem, high variance problem, and not suitable for live dataset. On the other hand, classifying a growing dataset is difficult for convolution based terrain classification. Supervised recurrent models are also not practical for this classification. Further, the existing recurrent architectures are still evolving to improve accuracy of terrain classification based on live variable-length sensory data collected from legged robots. This paper proposes a new semi-supervised method for terrain classification of legged robots, avoiding preprocessing of long variable-length dataset. The proposed method has a stacked Long Short-Term Memory architecture, including a new loss regularization. The proposed method solves the existing problems and improves accuracy. Comparison with the existing architectures show the improvements.
comment: Preliminary draft of the work published in IEEE conference 2023
☆ PARAMANU-AYN: An Efficient Novel Generative and Instruction-tuned Language Model for Indian Legal Case Documents
In this paper, we present PARAMANU-AYN, a language model based exclusively on case documents of the Supreme Court of India, the Constitution of India, and the Indian Penal Code. The novel Auto Regressive (AR) decoder based model is pretrained from scratch at a context size of 8192. We evaluated our pretrained legal model on perplexity metrics. We also instruction-tuned our pretrained model on a set of 10,763 instructions covering various legal tasks such as legal reasoning, judgement explanation, legal clause generation, legal drafting, legal contract drafting, case summarization, constitutional question-answering, etc. We also evaluated the responses of prompts for instruction-tuned models by GPT-3.5-Turbo on clarity, relevance, completeness, and legal reasoning metrics in a scale of 10. Our model can be run on CPU and achieved 42.46 tokens/sec CPU inference speed. We found that our models, despite not being pretrained on legal books, various legal contracts, and legal documents, were able to learn the domain knowledge required for drafting various legal contracts and legal clauses, and generalize to draft legal contracts and legal clauses with limited instruction tuning. Hence, we conclude that for a strong domain-specialized generative language model (such as legal), very large amounts of data are not required to develop models from scratch. We believe that this work is the first attempt to make a dedicated generative legal language model from scratch for Indian Supreme Court jurisdiction or in legal NLP overall. We plan to release our Paramanu-Ayn model at https://www.bharatgpts.com.
☆ Machine Learning Optimized Approach for Parameter Selection in MESHFREE Simulations
Meshfree simulation methods are emerging as compelling alternatives to conventional mesh-based approaches, particularly in the fields of Computational Fluid Dynamics (CFD) and continuum mechanics. In this publication, we provide a comprehensive overview of our research combining Machine Learning (ML) and Fraunhofer's MESHFREE software (www.meshfree.eu), a powerful tool utilizing a numerical point cloud in a Generalized Finite Difference Method (GFDM). This tool enables the effective handling of complex flow domains, moving geometries, and free surfaces, while allowing users to finely tune local refinement and quality parameters for an optimal balance between computation time and results accuracy. However, manually determining the optimal parameter combination poses challenges, especially for less experienced users. We introduce a novel ML-optimized approach, using active learning, regression trees, and visualization on MESHFREE simulation data, demonstrating the impact of input combinations on results quality and computation time. This research contributes valuable insights into parameter optimization in meshfree simulations, enhancing accessibility and usability for a broader user base in scientific and engineering applications.
☆ Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability Detection
Recent advancements in non-invasive detection of cardiac hemodynamic instability (CHDI) primarily focus on applying machine learning techniques to a single data modality, e.g. cardiac magnetic resonance imaging (MRI). Despite their potential, these approaches often fall short especially when the size of labeled patient data is limited, a common challenge in the medical domain. Furthermore, only a few studies have explored multimodal methods to study CHDI, which mostly rely on costly modalities such as cardiac MRI and echocardiogram. In response to these limitations, we propose a novel multimodal variational autoencoder ($\text{CardioVAE}_\text{X,G}$) to integrate low-cost chest X-ray (CXR) and electrocardiogram (ECG) modalities with pre-training on a large unlabeled dataset. Specifically, $\text{CardioVAE}_\text{X,G}$ introduces a novel tri-stream pre-training strategy to learn both shared and modality-specific features, thus enabling fine-tuning with both unimodal and multimodal datasets. We pre-train $\text{CardioVAE}_\text{X,G}$ on a large, unlabeled dataset of $50,982$ subjects from a subset of MIMIC database and then fine-tune the pre-trained model on a labeled dataset of $795$ subjects from the ASPIRE registry. Comprehensive evaluations against existing methods show that $\text{CardioVAE}_\text{X,G}$ offers promising performance (AUROC $=0.79$ and Accuracy $=0.77$), representing a significant step forward in non-invasive prediction of CHDI. Our model also excels in producing fine interpretations of predictions directly associated with clinical features, thereby supporting clinical decision-making.
☆ Efficient exploration of high-Tc superconductors by a gradient-based composition design
We propose a material design method via gradient-based optimization on compositions, overcoming the limitations of traditional methods: exhaustive database searches and conditional generation models. It optimizes inputs via backpropagation, aligning the model's output closely with the target property and facilitating the discovery of unlisted materials and precise property determination. Our method is also capable of adaptive optimization under new conditions without retraining. Applying to exploring high-Tc superconductors, we identified potential compositions beyond existing databases and discovered new hydrogen superconductors via conditional optimization. This method is versatile and significantly advances material design by enabling efficient, extensive searches and adaptability to new constraints.
☆ Enhancing Law Enforcement Training: A Gamified Approach to Detecting Terrorism Financing
Tools for fighting cyber-criminal activities using new technologies are promoted and deployed every day. However, too often, they are unnecessarily complex and hard to use, requiring deep domain and technical knowledge. These characteristics often limit the engagement of law enforcement and end-users in these technologies that, despite their potential, remain misunderstood. For this reason, in this study, we describe our experience in combining learning and training methods and the potential benefits of gamification to enhance technology transfer and increase adult learning. In fact, in this case, participants are experienced practitioners in professions/industries that are exposed to terrorism financing (such as Law Enforcement Officers, Financial Investigation Officers, private investigators, etc.) We define training activities on different levels for increasing the exchange of information about new trends and criminal modus operandi among and within law enforcement agencies, intensifying cross-border cooperation and supporting efforts to combat and prevent terrorism funding activities. On the other hand, a game (hackathon) is designed to address realistic challenges related to the dark net, crypto assets, new payment systems and dark web marketplaces that could be used for terrorist activities. The entire methodology was evaluated using quizzes, contest results, and engagement metrics. In particular, training events show about 60% of participants complete the 11-week training course, while the Hackathon results, gathered in two pilot studies (Madrid and The Hague), show increasing expertise among the participants (progression in the achieved points on average). At the same time, more than 70% of participants positively evaluate the use of the gamification approach, and more than 85% of them consider the implemented Use Cases suitable for their investigations.
☆ Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?
Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: The aim of this study is to evaluate the Mann-Whitney U test on DP-synthetic biomedical data in terms of Type I and Type II errors, in order to establish whether statistical hypothesis testing performed on privacy preserving synthetic data is likely to lead to loss of test's validity or decreased power. Methods: We evaluate the Mann-Whitney U test on DP-synthetic data generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on data drawn from two Gaussian distributions. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: Most of the tested DP-synthetic data generation methods showed inflated Type I error, especially at privacy budget levels of $\epsilon\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($\epsilon\geq 5$) in order to have reasonable Type II error levels.
☆ CONLINE: Complex Code Generation and Refinement with Online Searching and Correctness Testing
Large Language Models (LLMs) have revolutionized code generation ability by converting natural language descriptions into executable code. However, generating complex code within real-world scenarios remains challenging due to intricate structures, subtle bugs, understanding of advanced data types, and lack of supplementary contents. To address these challenges, we introduce the CONLINE framework, which enhances code generation by incorporating planned online searches for information retrieval and automated correctness testing for iterative refinement. CONLINE also serializes the complex inputs and outputs to improve comprehension and generate test case to ensure the framework's adaptability for real-world applications. CONLINE is validated through rigorous experiments on the DS-1000 and ClassEval datasets. It shows that CONLINE substantially improves the quality of complex code generation, highlighting its potential to enhance the practicality and reliability of LLMs in generating intricate code.
☆ Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
In this paper, we study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We focus on the task of counselor reflection generation, where we optimize the generators to simultaneously improve the fluency, coherence, and reflection quality of generated counselor responses. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously. Specifically, we employ non-contextual and contextual multi-arm bandits to dynamically adjust multiple reward weights during training. Through automatic and manual evaluations, we show that our proposed techniques, DynaOpt and C-DynaOpt, outperform existing naive and bandit baselines, showcasing their potential for enhancing language models.
☆ AdaTrans: Feature-wise and Sample-wise Adaptive Transfer Learning for High-dimensional Regression
We consider the transfer learning problem in the high dimensional setting, where the feature dimension is larger than the sample size. To learn transferable information, which may vary across features or the source samples, we propose an adaptive transfer learning method that can detect and aggregate the feature-wise (F-AdaTrans) or sample-wise (S-AdaTrans) transferable structures. We achieve this by employing a novel fused-penalty, coupled with weights that can adapt according to the transferable structure. To choose the weight, we propose a theoretically informed, data-driven procedure, enabling F-AdaTrans to selectively fuse the transferable signals with the target while filtering out non-transferable signals, and S-AdaTrans to obtain the optimal combination of information transferred from each source sample. The non-asymptotic rates are established, which recover existing near-minimax optimal rates in special cases. The effectiveness of the proposed method is validated using both synthetic and real data.
comment: Technical Report
☆ DL2Fence: Integrating Deep Learning and Frame Fusion for Enhanced Detection and Localization of Refined Denial-of-Service in Large-Scale NoCs
This study introduces a refined Flooding Injection Rate-adjustable Denial-of-Service (DoS) model for Network-on-Chips (NoCs) and more importantly presents DL2Fence, a novel framework utilizing Deep Learning (DL) and Frame Fusion (2F) for DoS detection and localization. Two Convolutional Neural Networks models for classification and segmentation were developed to detect and localize DoS respectively. It achieves detection and localization accuracies of 95.8\% and 91.7\%, and precision rates of 98.5\% and 99.3\% in a 16x16 mesh NoC. The framework's hardware overhead notably decreases by 76.3\% when scaling from 8x8 to 16x16 NoCs, and it requires 42.4\% less hardware compared to state-of-the-arts. This advancement demonstrates DL2Fence's effectiveness in balancing outstanding detection performance in large-scale NoCs with extremely low hardware overhead.
☆ Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing
Despite recent advancements in text-to-image diffusion models facilitating various image editing techniques, complex text prompts often lead to an oversight of some requests due to a bottleneck in processing text information. To tackle this challenge, we present Ground-A-Score, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation. This approach ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image. Moreover, the selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas while preserving the integrity of the objects in the source image. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts, ensuring high-quality outcomes that respect the original image attributes.
☆ Integrating Large Language Models for Severity Classification in Traffic Incident Management: A Machine Learning Approach
This study evaluates the impact of large language models on enhancing machine learning processes for managing traffic incidents. It examines the extent to which features generated by modern language models improve or match the accuracy of predictions when classifying the severity of incidents using accident reports. Multiple comparisons performed between combinations of language models and machine learning algorithms, including Gradient Boosted Decision Trees, Random Forests, and Extreme Gradient Boosting. Our research uses both conventional and language model-derived features from texts and incident reports, and their combinations to perform severity classification. Incorporating features from language models with those directly obtained from incident reports has shown to improve, or at least match, the performance of machine learning techniques in assigning severity levels to incidents, particularly when employing Random Forests and Extreme Gradient Boosting methods. This comparison was quantified using the F1-score over uniformly sampled data sets to obtain balanced severity classes. The primary contribution of this research is in the demonstration of how Large Language Models can be integrated into machine learning workflows for incident management, thereby simplifying feature extraction from unstructured text and enhancing or matching the precision of severity predictions using conventional machine learning pipeline. The engineering application of this research is illustrated through the effective use of these language processing models to refine the modelling process for incident severity classification. This work provides significant insights into the application of language processing capabilities in combination with traditional data for improving machine learning pipelines in the context of classifying incident severity.
☆ Next day fire prediction via semantic segmentation ACL
In this paper we present a deep learning pipeline for next day fire prediction. The next day fire prediction task consists in learning models that receive as input the available information for an area up until a certain day, in order to predict the occurrence of fire for the next day. Starting from our previous problem formulation as a binary classification task on instances (daily snapshots of each area) represented by tabular feature vectors, we reformulate the problem as a semantic segmentation task on images; there, each pixel corresponds to a daily snapshot of an area, while its channels represent the formerly tabular training features. We demonstrate that this problem formulation, built within a thorough pipeline achieves state of the art results.
comment: Accepted in MACLEAN@ECML/PKDD 2023
☆ What explains the success of cross-modal fine-tuning with ORCA?
ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.
☆ Have You Poisoned My Data? Defending Neural Networks against Data Poisoning ESORICS
The unprecedented availability of training data fueled the rapid development of powerful neural networks in recent years. However, the need for such large amounts of data leads to potential threats such as poisoning attacks: adversarial manipulations of the training data aimed at compromising the learned model to achieve a given adversarial goal. This paper investigates defenses against clean-label poisoning attacks and proposes a novel approach to detect and filter poisoned datapoints in the transfer learning setting. We define a new characteristic vector representation of datapoints and show that it effectively captures the intrinsic properties of the data distribution. Through experimental analysis, we demonstrate that effective poisons can be successfully differentiated from clean points in the characteristic vector space. We thoroughly evaluate our proposed approach and compare it to existing state-of-the-art defenses using multiple architectures, datasets, and poison budgets. Our evaluation shows that our proposal outperforms existing approaches in defense rate and final trained model performance across all experimental settings.
comment: Paper accepted for publication at European Symposium on Research in Computer Security (ESORICS) 2024
☆ REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning
Exemplar-free class-incremental learning (EFCIL) aims to mitigate catastrophic forgetting in class-incremental learning without available historical data. Compared with its counterpart (replay-based CIL) that stores historical samples, the EFCIL suffers more from forgetting issues under the exemplar-free constraint. In this paper, inspired by the recently developed analytic learning (AL) based CIL, we propose a representation enhanced analytic learning (REAL) for EFCIL. The REAL constructs a dual-stream base pretraining (DS-BPT) and a representation enhancing distillation (RED) process to enhance the representation of the extractor. The DS-BPT pretrains model in streams of both supervised learning and self-supervised contrastive learning (SSCL) for base knowledge extraction. The RED process distills the supervised knowledge to the SSCL pretrained backbone and facilitates a subsequent AL-basd CIL that converts the CIL to a recursive least-square problem. Our method addresses the issue of insufficient discriminability in representations of unseen data caused by a frozen backbone in the existing AL-based CIL. Empirical results on various datasets including CIFAR-100, ImageNet-100 and ImageNet-1k, demonstrate that our REAL outperforms the state-of-the-arts in EFCIL, and achieves comparable or even more superior performance compared with the replay-based methods.
☆ Adversarial Attacks and Defenses in Automated Control Systems: A Comprehensive Benchmark
Integrating machine learning into Automated Control Systems (ACS) enhances decision-making in industrial process management. One of the limitations to the widespread adoption of these technologies in industry is the vulnerability of neural networks to adversarial attacks. This study explores the threats in deploying deep learning models for fault diagnosis in ACS using the Tennessee Eastman Process dataset. By evaluating three neural networks with different architectures, we subject them to six types of adversarial attacks and explore five different defense methods. Our results highlight the strong vulnerability of models to adversarial samples and the varying effectiveness of defense strategies. We also propose a novel protection approach by combining multiple defense methods and demonstrate it's efficacy. This research contributes several insights into securing machine learning within ACS, ensuring robust fault diagnosis in industrial processes.
☆ VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.
comment: Project page: https://yumengli007.github.io/VSTAR
☆ Robustness Verifcation in Neural Networks
In this paper we investigate formal verification problems for Neural Network computations. Of central importance will be various robustness and minimization problems such as: Given symbolic specifications of allowed inputs and outputs in form of Linear Programming instances, one question is whether there do exist valid inputs such that the network computes a valid output? And does this property hold for all valid inputs? Do two given networks compute the same function? Is there a smaller network computing the same function? The complexity of these questions have been investigated recently from a practical point of view and approximated by heuristic algorithms. We complement these achievements by giving a theoretical framework that enables us to interchange security and efficiency questions in neural networks and analyze their computational complexities. We show that the problems are conquerable in a semi-linear setting, meaning that for piecewise linear activation functions and when the sum- or maximum metric is used, most of them are in P or in NP at most.
comment: 16 pages, 1 figure
☆ Detecting and Triaging Spoofing using Temporal Convolutional Networks
As algorithmic trading and electronic markets continue to transform the landscape of financial markets, detecting and deterring rogue agents to maintain a fair and efficient marketplace is crucial. The explosion of large datasets and the continually changing tricks of the trade make it difficult to adapt to new market conditions and detect bad actors. To that end, we propose a framework that can be adapted easily to various problems in the space of detecting market manipulation. Our approach entails initially employing a labelling algorithm which we use to create a training set to learn a weakly supervised model to identify potentially suspicious sequences of order book states. The main goal here is to learn a representation of the order book that can be used to easily compare future events. Subsequently, we posit the incorporation of expert assessment to scrutinize specific flagged order book states. In the event of an expert's unavailability, recourse is taken to the application of a more complex algorithm on the identified suspicious order book states. We then conduct a similarity search between any new representation of the order book against the expert labelled representations to rank the results of the weak learner. We show some preliminary results that are promising to explore further in this direction
☆ Byzantine-resilient Federated Learning With Adaptivity to Data Heterogeneity
This paper deals with federated learning (FL) in the presence of malicious Byzantine attacks and data heterogeneity. A novel Robust Average Gradient Algorithm (RAGA) is proposed, which leverages the geometric median for aggregation and can freely select the round number for local updating. Different from most existing resilient approaches, which perform convergence analysis based on strongly-convex loss function or homogeneously distributed dataset, we conduct convergence analysis for not only strongly-convex but also non-convex loss function over heterogeneous dataset. According to our theoretical analysis, as long as the fraction of dataset from malicious users is less than half, RAGA can achieve convergence at rate $\mathcal{O}({1}/{T^{2/3- \delta}})$ where $T$ is the iteration number and $\delta \in (0, 2/3)$ for non-convex loss function, and at linear rate for strongly-convex loss function. Moreover, stationary point or global optimal solution is proved to obtainable as data heterogeneity vanishes. Experimental results corroborate the robustness of RAGA to Byzantine attacks and verifies the advantage of RAGA over baselines on convergence performance under various intensity of Byzantine attacks, for heterogeneous dataset.
☆ Counting Network for Learning from Majority Label ICASSP 2024
The paper proposes a novel problem in multi-class Multiple-Instance Learning (MIL) called Learning from the Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag's label. LML aims to classify instances using bag-level majority classes. This problem is valuable in various applications. Existing MIL methods are unsuitable for LML due to aggregating confidences, which may lead to inconsistency between the bag-level label and the label obtained by counting the number of instances for each class. This may lead to incorrect instance-level classification. We propose a counting network trained to produce the bag-level majority labels estimated by counting the number of instances for each class. This led to the consistency of the majority class between the network outputs and one obtained by counting the number of instances. Experimental results show that our counting network outperforms conventional MIL methods on four datasets The code is publicly available at https://github.com/Shiku-Kaito/Counting-Network-for-Learning-from-Majority-Label.
comment: 5 pages, 4 figures, Accepted in ICASSP 2024
☆ Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting
Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource.
☆ Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection
Unified anomaly detection (AD) is one of the most challenges for anomaly detection, where one unified model is trained with normal samples from multiple classes with the objective to detect anomalies in these classes. For such a challenging task, popular normalizing flow (NF) based AD methods may fall into a "homogeneous mapping" issue,where the NF-based AD models are biased to generate similar latent representations for both normal and abnormal features, and thereby lead to a high missing rate of anomalies. In this paper, we propose a novel Hierarchical Gaussian mixture normalizing flow modeling method for accomplishing unified Anomaly Detection, which we call HGAD. Our HGAD consists of two key components: inter-class Gaussian mixture modeling and intra-class mixed class centers learning. Compared to the previous NF-based AD methods, the hierarchical Gaussian mixture modeling approach can bring stronger representation capability to the latent space of normalizing flows, so that even complex multi-class distribution can be well represented and learned in the latent space. In this way, we can avoid mapping different class distributions into the same single Gaussian prior, thus effectively avoiding or mitigating the "homogeneous mapping" issue. We further indicate that the more distinguishable different class centers, the more conducive to avoiding the bias issue. Thus, we further propose a mutual information maximization loss for better structuring the latent feature space. We evaluate our method on four real-world AD benchmarks, where we can significantly improve the previous NF-based AD methods and also outperform the SOTA unified AD methods.
comment: 15 pages
☆ USE: Dynamic User Modeling with Stateful Sequence Models
User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.
☆ Adaptive Ensembles of Fine-Tuned Transformers for LLM-Generated Text Detection
Large language models (LLMs) have reached human-like proficiency in generating diverse textual content, underscoring the necessity for effective fake text detection to avoid potential risks such as fake news in social media. Previous research has mostly tested single models on in-distribution datasets, limiting our understanding of how these models perform on different types of data for LLM-generated text detection task. We researched this by testing five specialized transformer-based models on both in-distribution and out-of-distribution datasets to better assess their performance and generalizability. Our results revealed that single transformer-based classifiers achieved decent performance on in-distribution dataset but limited generalization ability on out-of-distribution dataset. To improve it, we combined the individual classifiers models using adaptive ensemble algorithms, which improved the average accuracy significantly from 91.8% to 99.2% on an in-distribution test set and from 62.9% to 72.5% on an out-of-distribution test set. The results indicate the effectiveness, good generalization ability, and great potential of adaptive ensemble algorithms in LLM-generated text detection.
☆ HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling
The integration of diverse clinical modalities such as medical imaging and the tabular data obtained by the patients' Electronic Health Records (EHRs) is a crucial aspect of modern healthcare. The integrative analysis of multiple sources can provide a comprehensive understanding of a patient's condition and can enhance diagnoses and treatment decisions. Deep Neural Networks (DNNs) consistently showcase outstanding performance in a wide range of multimodal tasks in the medical domain. However, the complex endeavor of effectively merging medical imaging with clinical, demographic and genetic information represented as numerical tabular data remains a highly active and ongoing research pursuit. We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR's values and measurements. This approach aims to leverage the complementary information present in these modalities to enhance the accuracy of various medical applications. We demonstrate the strength and the generality of our method on two different brain Magnetic Resonance Imaging (MRI) analysis tasks, namely, brain age prediction conditioned by subject's sex, and multiclass Alzheimer's Disease (AD) classification conditioned by tabular data. We show that our framework outperforms both single-modality models and state-of-the-art MRI-tabular data fusion methods. The code, enclosed to this manuscript will be made publicly available.
comment: 17 pages, 8 figures
☆ A Semantic Search Engine for Mathlib4
The interactive theorem prover, Lean, enables the verification of formal mathematical proofs and is backed by an expanding community. Central to this ecosystem is its mathematical library, mathlib4, which lays the groundwork for the formalization of an expanding range of mathematical theories. However, searching for theorems in mathlib4 can be challenging. To successfully search in mathlib4, users often need to be familiar with its naming conventions or documentation strings. Therefore, creating a semantic search engine that can be used easily by individuals with varying familiarity with mathlib4 is very important. In this paper, we present a semantic search engine for mathlib4 that accepts informal queries and finds the relevant theorems. We also establish a benchmark for assessing the performance of various search engines for mathlib4.
☆ Kernel Multigrid: Accelerate Back-fitting via Sparse Gaussian Process Regression
Additive Gaussian Processes (GPs) are popular approaches for nonparametric feature selection. The common training method for these models is Bayesian Back-fitting. However, the convergence rate of Back-fitting in training additive GPs is still an open problem. By utilizing a technique called Kernel Packets (KP), we prove that the convergence rate of Back-fitting is no faster than $(1-\mathcal{O}(\frac{1}{n}))^t$, where $n$ and $t$ denote the data size and the iteration number, respectively. Consequently, Back-fitting requires a minimum of $\mathcal{O}(n\log n)$ iterations to achieve convergence. Based on KPs, we further propose an algorithm called Kernel Multigrid (KMG). This algorithm enhances Back-fitting by incorporating a sparse Gaussian Process Regression (GPR) to process the residuals subsequent to each Back-fitting iteration. It is applicable to additive GPs with both structured and scattered data. Theoretically, we prove that KMG reduces the required iterations to $\mathcal{O}(\log n)$ while preserving the time and space complexities at $\mathcal{O}(n\log n)$ and $\mathcal{O}(n)$ per iteration, respectively. Numerically, by employing a sparse GPR with merely 10 inducing points, KMG can produce accurate approximations of high-dimensional targets within 5 iterations.
☆ Bridging scales in multiscale bubble growth dynamics with correlated fluctuations using neural operator learning
The intricate process of bubble growth dynamics involves a broad spectrum of physical phenomena from microscale mechanics of bubble formation to macroscale interplay between bubbles and surrounding thermo-hydrodynamics. Traditional bubble dynamics models including atomistic approaches and continuum-based methods segment the bubble dynamics into distinct scale-specific models. In order to bridge the gap between microscale stochastic fluid models and continuum-based fluid models for bubble dynamics, we develop a composite neural operator model to unify the analysis of nonlinear bubble dynamics across microscale and macroscale regimes by integrating a many-body dissipative particle dynamics (mDPD) model with a continuum-based Rayleigh-Plesset (RP) model through a novel neural network architecture, which consists of a deep operator network for learning the mean behavior of bubble growth subject to pressure variations and a long short-term memory network for learning the statistical features of correlated fluctuations in microscale bubble dynamics. Training and testing data are generated by conducting mDPD and RP simulations for nonlinear bubble dynamics with initial bubble radii ranging from 0.1 to 1.5 micrometers. Results show that the trained composite neural operator model can accurately predict bubble dynamics across scales, with a 99% accuracy for the time evaluation of the bubble radius under varying external pressure while containing correct size-dependent stochastic fluctuations in microscale bubble growth dynamics. The composite neural operator is the first deep learning surrogate for multiscale bubble growth dynamics that can capture correct stochastic fluctuations in microscopic fluid phenomena, which sets a new direction for future research in multiscale fluid dynamics modeling.
comment: 19 pages, 9 figures
☆ Rotary Position Embedding for Vision Transformer
Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit
comment: 20 pages, 5 figures
☆ Building Optimal Neural Architectures using Interpretable Knowledge CVPR'24
Neural Architecture Search is a costly practice. The fact that a search space can span a vast number of design choices with each architecture evaluation taking nontrivial overhead makes it hard for an algorithm to sufficiently explore candidate networks. In this paper, we propose AutoBuild, a scheme which learns to align the latent embeddings of operations and architecture modules with the ground-truth performance of the architectures they appear in. By doing so, AutoBuild is capable of assigning interpretable importance scores to architecture modules, such as individual operation features and larger macro operation sequences such that high-performance neural networks can be constructed without any need for search. Through experiments performed on state-of-the-art image classification, segmentation, and Stable Diffusion models, we show that by mining a relatively small set of evaluated architectures, AutoBuild can learn to build high-quality architectures directly or help to reduce search space to focus on relevant areas, finding better architectures that outperform both the original labeled ones and ones found by search baselines. Code available at https://github.com/Ascend-Research/AutoBuild
comment: CVPR'24; 18 Pages, 18 Figures, 3 Tables
☆ A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs
Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m- dimensional random walk that accounts for the paths specified in a hypothesis. We further optimize its time efficiency and propose PHASEopt. Experiments on real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling in terms of accuracy and time efficiency.
☆ AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models
We present a novel Parameter-Efficient Fine-Tuning (PEFT) method, dubbed as Adaptive Freezing of Low Rank Adaptation (AFLoRA). Specifically, for each pre-trained frozen weight tensor, we add a parallel path of trainable low-rank matrices, namely a down-projection and an up-projection matrix, each of which is followed by a feature transformation vector. Based on a novel freezing score, we the incrementally freeze these projection matrices during fine-tuning to reduce the computation and alleviate over-fitting. Our experimental results demonstrate that we can achieve state-of-the-art performance with an average improvement of up to $0.85\%$ as evaluated on GLUE benchmark while yeilding up to $9.5\times$ fewer average trainable parameters. While compared in terms of runtime, AFLoRA can yield up to $1.86\times$ improvement as opposed to similar PEFT alternatives. Besides the practical utility of our approach, we provide insights on the trainability requirements of LoRA paths at different modules and the freezing schedule for the different projection matrices. Code will be released.
comment: 5 pages, 5 figures
☆ Unifews: Unified Entry-Wise Sparsification for Efficient Graph Neural Network
Graph Neural Networks (GNNs) have shown promising performance in various graph learning tasks, but at the cost of resource-intensive computations. The primary overhead of GNN update stems from graph propagation and weight transformation, both involving operations on graph-scale matrices. Previous studies attempt to reduce the computational budget by leveraging graph-level or network-level sparsification techniques, resulting in downsized graph or weights. In this work, we propose Unifews, which unifies the two operations in an entry-wise manner considering individual matrix elements, and conducts joint edge-weight sparsification to enhance learning efficiency. The entry-wise design of Unifews enables adaptive compression across GNN layers with progressively increased sparsity, and is applicable to a variety of architectural designs with on-the-fly operation simplification. Theoretically, we establish a novel framework to characterize sparsified GNN learning in view of a graph optimization process, and prove that Unifews effectively approximates the learning objective with bounded error and reduced computational load. We conduct extensive experiments to evaluate the performance of our method in diverse settings. Unifews is advantageous in jointly removing more than 90% of edges and weight entries with comparable or better accuracy than baseline models. The sparsification offers remarkable efficiency improvements including 10-20x matrix operation reduction and up to 100x acceleration in graph propagation time for the largest graph at the billion-edge scale.
☆ Arcee's MergeKit: A Toolkit for Merging Large Language Models
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pre-trained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multi-task learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.
comment: 11 pages, 4 figures
☆ A Unified and General Framework for Continual Learning ICLR 2024
Continual Learning (CL) focuses on learning from dynamic and changing data distributions while retaining previously acquired knowledge. Various methods have been developed to address the challenge of catastrophic forgetting, including regularization-based, Bayesian-based, and memory-replay-based techniques. However, these methods lack a unified framework and common terminology for describing their approaches. This research aims to bridge this gap by introducing a comprehensive and overarching framework that encompasses and reconciles these existing methodologies. Notably, this new framework is capable of encompassing established CL approaches as special instances within a unified and general optimization objective. An intriguing finding is that despite their diverse origins, these methods share common mathematical structures. This observation highlights the compatibility of these seemingly distinct techniques, revealing their interconnectedness through a shared underlying optimization objective. Moreover, the proposed general framework introduces an innovative concept called refresh learning, specifically designed to enhance the CL performance. This novel approach draws inspiration from neuroscience, where the human brain often sheds outdated information to improve the retention of crucial knowledge and facilitate the acquisition of new information. In essence, refresh learning operates by initially unlearning current data and subsequently relearning it. It serves as a versatile plug-in that seamlessly integrates with existing CL methods, offering an adaptable and effective enhancement to the learning process. Extensive experiments on CL benchmarks and theoretical analysis demonstrate the effectiveness of the proposed refresh learning. Code is available at \url{https://github.com/joey-wang123/CL-refresh-learning}.
comment: ICLR 2024
☆ Decentralized Federated Learning: Model Update Tracking Under Imperfect Information Sharing
A novel Decentralized Noisy Model Update Tracking Federated Learning algorithm (FedNMUT) is proposed, which is tailored to function efficiently in the presence of noisy communication channels that reflect imperfect information exchange. This algorithm uses gradient tracking to minimize the impact of data heterogeneity while minimizing communication overhead. The proposed algorithm incorporates noise into its parameters to mimic the conditions of noisy communication channels, thereby enabling consensus among clients through a communication graph topology in such challenging environments. FedNMUT prioritizes parameter sharing and noise incorporation to increase the resilience of decentralized learning systems against noisy communications. Through theoretical and empirical validation, it is demonstrated that the performance of FedNMUT is superior compared to the existing state-of-the-art methods and conventional parameter-mixing approaches in dealing with imperfect information sharing. This proves the capability of the proposed algorithm to counteract the negative effects of communication noise in a decentralized learning framework.
comment: arXiv admin note: substantial text overlap with arXiv:2303.10695
☆ Divide-Conquer Transformer Learning for Predicting Electric Vehicle Charging Events Using Smart Meter Data
Predicting electric vehicle (EV) charging events is crucial for load scheduling and energy management, promoting seamless transportation electrification and decarbonization. While prior studies have focused on EV charging demand prediction, primarily for public charging stations using historical charging data, home charging prediction is equally essential. However, existing prediction methods may not be suitable due to the unavailability of or limited access to home charging data. To address this research gap, inspired by the concept of non-intrusive load monitoring (NILM), we develop a home charging prediction method using historical smart meter data. Different from NILM detecting EV charging that has already occurred, our method provides predictive information of future EV charging occurrences, thus enhancing its utility for charging management. Specifically, our method, leverages a self-attention mechanism-based transformer model, employing a ``divide-conquer'' strategy, to process historical meter data to effectively and learn EV charging representation for charging occurrence prediction. Our method enables prediction at one-minute interval hour-ahead. Experimental results demonstrate the effectiveness of our method, achieving consistently high accuracy of over 96.81\% across different prediction time spans. Notably, our method achieves high prediction performance solely using smart meter data, making it a practical and suitable solution for grid operators.
comment: 2024 IEEE Power & Energy Society General Meeting (PESGM)
☆ Federated reinforcement learning for robot motion planning with zero-shot generalization
This paper considers the problem of learning a control policy for robot motion planning with zero-shot generalization, i.e., no data collection and policy adaptation is needed when the learned policy is deployed in new environments. We develop a federated reinforcement learning framework that enables collaborative learning of multiple learners and a central server, i.e., the Cloud, without sharing their raw data. In each iteration, each learner uploads its local control policy and the corresponding estimated normalized arrival time to the Cloud, which then computes the global optimum among the learners and broadcasts the optimal policy to the learners. Each learner then selects between its local control policy and that from the Cloud for next iteration. The proposed framework leverages on the derived zero-shot generalization guarantees on arrival time and safety. Theoretical guarantees on almost-sure convergence, almost consensus, Pareto improvement and optimality gap are also provided. Monte Carlo simulation is conducted to evaluate the proposed framework.
☆ A Comparative Study of Machine Learning Models Predicting Energetics of Interacting Defects
Interacting defect systems are ubiquitous in materials under realistic scenarios, yet gaining an atomic-level understanding of these systems from a computational perspective is challenging - it often demands substantial resources due to the necessity of employing supercell calculations. While machine learning techniques have shown potential in accelerating materials simulations, their application to systems involving interacting defects remains relatively rare. In this work, we present a comparative study of three different methods to predict the free energy change of systems with interacting defects. We leveraging a limited dataset from Density Functional Theory(DFT) calculations to assess the performance models using materials descriptors, graph neural networks and cluster expansion. Our findings indicate that the cluster expansion model can achieve precise energetics predictions even with this limited dataset. Furthermore, with synthetic data generate from cluster expansion model at near-DFT levels, we obtained enlarged dataset to assess the demands on data for training accurate prediction models using graph neural networks for systems featuring interacting defects. A brief discussion of the computational cost for each method is provided at the end. This research provide a preliminary evaluation of applying machine learning techniques in imperfect surface systems.
☆ Tackling Noisy Labels with Network Parameter Additive Decomposition
Given data with noisy labels, over-parameterized deep networks suffer overfitting mislabeled data, resulting in poor generalization. The memorization effect of deep networks shows that although the networks have the ability to memorize all noisy data, they would first memorize clean training data, and then gradually memorize mislabeled training data. A simple and effective method that exploits the memorization effect to combat noisy labels is early stopping. However, early stopping cannot distinguish the memorization of clean data and mislabeled data, resulting in the network still inevitably overfitting mislabeled data in the early training stage.In this paper, to decouple the memorization of clean data and mislabeled data, and further reduce the side effect of mislabeled data, we perform additive decomposition on network parameters. Namely, all parameters are additively decomposed into two groups, i.e., parameters $\mathbf{w}$ are decomposed as $\mathbf{w}=\bm{\sigma}+\bm{\gamma}$. Afterward, the parameters $\bm{\sigma}$ are considered to memorize clean data, while the parameters $\bm{\gamma}$ are considered to memorize mislabeled data. Benefiting from the memorization effect, the updates of the parameters $\bm{\sigma}$ are encouraged to fully memorize clean data in early training, and then discouraged with the increase of training epochs to reduce interference of mislabeled data. The updates of the parameters $\bm{\gamma}$ are the opposite. In testing, only the parameters $\bm{\sigma}$ are employed to enhance generalization. Extensive experiments on both simulated and real-world benchmarks confirm the superior performance of our method.
comment: Accepted by IEEE T-PAMI
☆ Diffusion Model for Data-Driven Black-Box Optimization
Generative AI has redefined artificial intelligence, enabling the creation of innovative content and customized solutions that drive business practices into a new era of efficiency and creativity. In this paper, we focus on diffusion models, a powerful generative AI technology, and investigate their potential for black-box optimization over complex structured variables. Consider the practical scenario where one wants to optimize some structured design in a high-dimensional space, based on massive unlabeled data (representing design variables) and a small labeled dataset. We study two practical types of labels: 1) noisy measurements of a real-valued reward function and 2) human preference based on pairwise comparisons. The goal is to generate new designs that are near-optimal and preserve the designed latent structures. Our proposed method reformulates the design optimization problem into a conditional sampling problem, which allows us to leverage the power of diffusion models for modeling complex distributions. In particular, we propose a reward-directed conditional diffusion model, to be trained on the mixed data, for sampling a near-optimal solution conditioned on high predicted rewards. Theoretically, we establish sub-optimality error bounds for the generated designs. The sub-optimality gap nearly matches the optimal guarantee in off-policy bandits, demonstrating the efficiency of reward-directed diffusion models for black-box optimization. Moreover, when the data admits a low-dimensional latent subspace structure, our model efficiently generates high-fidelity designs that closely respect the latent structure. We provide empirical experiments validating our model in decision-making and content-creation tasks.
comment: arXiv admin note: substantial text overlap with arXiv:2307.07055
☆ Nellie: Automated organelle segmentation, tracking, and hierarchical feature extraction in 2D/3D live-cell microscopy
The analysis of dynamic organelles remains a formidable challenge, though key to understanding biological processes. We introduce Nellie, an automated and unbiased pipeline for segmentation, tracking, and feature extraction of diverse intracellular structures. Nellie adapts to image metadata, eliminating user input. Nellie's preprocessing pipeline enhances structural contrast on multiple intracellular scales allowing for robust hierarchical segmentation of sub-organellar regions. Internal motion capture markers are generated and tracked via a radius-adaptive pattern matching scheme, and used as guides for sub-voxel flow interpolation. Nellie extracts a plethora of features at multiple hierarchical levels for deep and customizable analysis. Nellie features a Napari-based GUI that allows for code-free operation and visualization, while its modular open-source codebase invites customization by experienced users. We demonstrate Nellie's wide variety of use cases with two examples: unmixing multiple organelles from a single channel using feature-based classification and training an unsupervised graph autoencoder on mitochondrial multi-mesh graphs to quantify latent space embedding changes following ionomycin treatment.
comment: for associated code, see https://github.com/aelefebv/nellie; 82 pages, 5 main figures, 11 extended figures
☆ From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards ACL 2024
Recent progress in large language models (LLMs) has led to their widespread adoption in various domains. However, these advancements have also introduced additional safety risks and raised concerns regarding their detrimental impact on already marginalized populations. Despite growing mitigation efforts to develop safety safeguards, such as supervised safety-oriented fine-tuning and leveraging safe reinforcement learning from human feedback, multiple concerns regarding the safety and ingrained biases in these models remain. Furthermore, previous work has demonstrated that models optimized for safety often display exaggerated safety behaviors, such as a tendency to refrain from responding to certain requests as a precautionary measure. As such, a clear trade-off between the helpfulness and safety of these models has been documented in the literature. In this paper, we further investigate the effectiveness of safety measures by evaluating models on already mitigated biases. Using the case of Llama 2 as an example, we illustrate how LLMs' safety responses can still encode harmful assumptions. To do so, we create a set of non-toxic prompts, which we then use to evaluate Llama models. Through our new taxonomy of LLMs responses to users, we observe that the safety/helpfulness trade-offs are more pronounced for certain demographic groups which can lead to quality-of-service harms for marginalized populations.
comment: 9 pages, 4 figures, submitted to the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
♻ ☆ The Expressive Power of Transformers with Chain of Thought ICLR
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming a slight generalization to standard pre-norm, adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.
comment: 9-page preprint. Updated March 20 after ICLR acceptance
♻ ☆ Ada-NAV: Adaptive Trajectory Length-Based Sample Efficient Policy Learning for Robotic Navigation
Trajectory length stands as a crucial hyperparameter within reinforcement learning (RL) algorithms, significantly contributing to the sample inefficiency in robotics applications. Motivated by the pivotal role trajectory length plays in the training process, we introduce Ada-NAV, a novel adaptive trajectory length scheme designed to enhance the training sample efficiency of RL algorithms in robotic navigation tasks. Unlike traditional approaches that treat trajectory length as a fixed hyperparameter, we propose to dynamically adjust it based on the entropy of the underlying navigation policy. Interestingly, Ada-NAV can be applied to both existing on-policy and off-policy RL methods, which we demonstrate by empirically validating its efficacy on three popular RL methods: REINFORCE, Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). We demonstrate through simulated and real-world robotic experiments that Ada-NAV outperforms conventional methods that employ constant or randomly sampled trajectory lengths. Specifically, for a fixed sample budget, Ada-NAV achieves an 18\% increase in navigation success rate, a 20-38\% reduction in navigation path length, and a 9.32\% decrease in elevation costs. Furthermore, we showcase the versatility of Ada-NAV by integrating it with the Clearpath Husky robot, illustrating its applicability in complex outdoor environments.
comment: 11 pages, 9 figures, 2 tables
♻ ☆ Universal consistency of the $k$-NN rule in metric spaces and Nagata dimension. II
We continue to investigate the $k$ nearest neighbour ($k$-NN) learning rule in complete separable metric spaces. Thanks to the results of C\'erou and Guyader (2006) and Preiss (1983), this rule is known to be universally consistent in every such metric space that is sigma-finite dimensional in the sense of Nagata. Here we show that the rule is strongly universally consistent in such spaces in the absence of ties. Under the tie-breaking strategy applied by Devroye, Gy\"{o}rfi, Krzy\.{z}ak, and Lugosi (1994) in the Euclidean setting, we manage to show the strong universal consistency in non-Archimedian metric spaces (that is, those of Nagata dimension zero). Combining the theorem of C\'erou and Guyader with results of Assouad and Quentin de Gromard (2006), one deduces that the $k$-NN rule is universally consistent in metric spaces having finite dimension in the sense of de Groot. In particular, the $k$-NN rule is universally consistent in the Heisenberg group which is not sigma-finite dimensional in the sense of Nagata as follows from an example independently constructed by Kor\'anyi and Reimann (1995) and Sawyer and Wheeden (1992).
comment: Latex 2e, 27 pages, 1 figure. Minor revisions to conform with the last set of journal page proofs: two typos corrected, the bibliography rearranged in the order of citations (the ESAIM:PS home style), and two articles that were no longer cited removed
♻ ☆ Having Beer after Prayer? Measuring Cultural Bias in Large Language Models
As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures. CAMeL provides a foundation for measuring cultural biases in LMs through both extrinsic and intrinsic evaluations. Using CAMeL, we examine the cross-cultural performance in Arabic of 16 different LMs on tasks such as story generation, NER, and sentiment analysis, where we find concerning cases of stereotyping and cultural unfairness. We further test their text-infilling performance, revealing the incapability of appropriate adaptation to Arab cultural contexts. Finally, we analyze 6 Arabic pre-training corpora and find that commonly used sources such as Wikipedia may not be best suited to build culturally aware LMs, if used as they are without adjustment. We will make CAMeL publicly available at: https://github.com/tareknaous/camel
♻ ☆ The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms
We investigate a Bayesian $k$-armed bandit problem in the \emph{many-armed} regime, where $k \geq \sqrt{T}$ and $T$ represents the time horizon. Initially, and aligned with recent literature on many-armed bandit problems, we observe that subsampling plays a key role in designing optimal algorithms; the conventional UCB algorithm is sub-optimal, whereas a subsampled UCB (SS-UCB), which selects $\Theta(\sqrt{T})$ arms for execution under the UCB framework, achieves rate-optimality. However, despite SS-UCB's theoretical promise of optimal regret, it empirically underperforms compared to a greedy algorithm that consistently chooses the empirically best arm. This observation extends to contextual settings through simulations with real-world data. Our findings suggest a new form of \emph{free exploration} beneficial to greedy algorithms in the many-armed context, fundamentally linked to a tail event concerning the prior distribution of arm rewards. This finding diverges from the notion of free exploration, which relates to covariate variation, as recently discussed in contextual bandit literature. Expanding upon these insights, we establish that the subsampled greedy approach not only achieves rate-optimality for Bernoulli bandits within the many-armed regime but also attains sublinear regret across broader distributions. Collectively, our research indicates that in the many-armed regime, practitioners might find greater value in adopting greedy algorithms.
♻ ☆ Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels NeurIPS 2023
Intersection over Union (IoU) losses are surrogates that directly optimize the Jaccard index. Leveraging IoU losses as part of the loss function have demonstrated superior performance in semantic segmentation tasks compared to optimizing pixel-wise losses such as the cross-entropy loss alone. However, we identify a lack of flexibility in these losses to support vital training techniques like label smoothing, knowledge distillation, and semi-supervised learning, mainly due to their inability to process soft labels. To address this, we introduce Jaccard Metric Losses (JMLs), which are identical to the soft Jaccard loss in standard settings with hard labels but are fully compatible with soft labels. We apply JMLs to three prominent use cases of soft labels: label smoothing, knowledge distillation and semi-supervised learning, and demonstrate their potential to enhance model accuracy and calibration. Our experiments show consistent improvements over the cross-entropy loss across 4 semantic segmentation datasets (Cityscapes, PASCAL VOC, ADE20K, DeepGlobe Land) and 13 architectures, including classic CNNs and recent vision transformers. Remarkably, our straightforward approach significantly outperforms state-of-the-art knowledge distillation and semi-supervised learning methods. The code is available at \href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.
comment: NeurIPS 2023
♻ ☆ Roto-translated Local Coordinate Frames For Interacting Dynamical Systems NeurIPS 2021
Modelling interactions is critical in learning complex dynamical systems, namely systems of interacting objects with highly non-linear and time-dependent behaviour. A large class of such systems can be formalized as $\textit{geometric graphs}$, $\textit{i.e.}$, graphs with nodes positioned in the Euclidean space given an $\textit{arbitrarily}$ chosen global coordinate system, for instance vehicles in a traffic scene. Notwithstanding the arbitrary global coordinate system, the governing dynamics of the respective dynamical systems are invariant to rotations and translations, also known as $\textit{Galilean invariance}$. As ignoring these invariances leads to worse generalization, in this work we propose local coordinate frames per node-object to induce roto-translation invariance to the geometric graph of the interacting dynamical system. Further, the local coordinate frames allow for a natural definition of anisotropic filtering in graph neural networks. Experiments in traffic scenes, 3D motion capture, and colliding particles demonstrate that the proposed approach comfortably outperforms the recent state-of-the-art.
comment: In NeurIPS 2021. Source code: https://github.com/mkofinas/locs
♻ ☆ Deep Reinforcement Learning: A Convex Optimization Approach
In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to the given sampled states and actions of the current episode. For stable nonlinear systems, we show that the algorithm converges and that the converging parameters of the trained neural network can be made arbitrarily close to the optimal neural network parameters. In particular, if the regularization parameter is $\rho$ and the time horizon is $T$, then the parameters of the trained neural network converge to $w$, where the distance between $w$ from the optimal parameters $w^\star$ is bounded by $\mathcal{O}(\rho T^{-1})$. That is, when the number of episodes goes to infinity, there exists a constant $C$ such that \[\|w-w^\star\| \le C\cdot\frac{\rho}{T}.\] In particular, our algorithm converges arbitrarily close to the optimal neural network parameters as the time horizon increases or as the regularization parameter decreases.
♻ ☆ MCRAGE: Synthetic Healthcare Data for Fairness
In the field of healthcare, electronic health records (EHR) serve as crucial training data for developing machine learning models for diagnosis, treatment, and the management of healthcare resources. However, medical datasets are often imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and age. Machine learning models trained on class-imbalanced EHR datasets perform significantly worse in deployment for individuals of the minority classes compared to those from majority classes, which may lead to inequitable healthcare outcomes for minority groups. To address this challenge, we propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE), a novel approach to augment imbalanced datasets using samples generated by a deep generative model. The MCRAGE process involves training a Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes, which can be used to train less biased downstream models. We measure the performance of MCRAGE versus alternative approaches using Accuracy, F1 score and AUROC of these downstream models. We provide theoretical justification for our method in terms of recent convergence results for DDPMs.
comment: Keywords: synthetic electronic health records, conditional denoising diffusion probabilistic model, healthcare AI, tabular data, fairness, synthetic data. This paper is the result of work completed at the 2023 Emory University Department of Mathematics REU/RET program under the direction of Project Advisor Dr. Xi Yuanzhe. This work is sponsored by NSF DMS 2051019
♻ ☆ Normalizing flow-based deep variational Bayesian network for seismic multi-hazards and impacts estimation from InSAR imagery
Onsite disasters like earthquakes can trigger cascading hazards and impacts, such as landslides and infrastructure damage, leading to catastrophic losses; thus, rapid and accurate estimates are crucial for timely and effective post-disaster responses. Interferometric Synthetic aperture radar (InSAR) data is important in providing high-resolution onsite information for rapid hazard estimation. Most recent methods using InSAR imagery signals predict a single type of hazard and thus often suffer low accuracy due to noisy and complex signals induced by co-located hazards, impacts, and irrelevant environmental changes (e.g., vegetation changes, human activities). We introduce a novel stochastic variational inference with normalizing flows derived to jointly approximate posteriors of multiple unobserved hazards and impacts from noisy InSAR imagery.
comment: This paper needs to be reviewed by the USGS
♻ ☆ Graph Neural Networks for Learning Equivariant Representations of Neural Networks ICLR 2024
Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neural-graphs.
comment: In ICLR 2024. Source code: https://github.com/mkofinas/neural-graphs
♻ ☆ Latent Field Discovery In Interacting Dynamical Systems With Neural Fields NeurIPS 2023
Systems of interacting objects often evolve under the influence of field effects that govern their dynamics, yet previous works have abstracted away from such effects, and assume that systems evolve in a vacuum. In this work, we focus on discovering these fields, and infer them from the observed dynamics alone, without directly observing them. We theorize the presence of latent force fields, and propose neural fields to learn them. Since the observed dynamics constitute the net effect of local object interactions and global field effects, recently popularized equivariant networks are inapplicable, as they fail to capture global information. To address this, we propose to disentangle local object interactions -- which are $\mathrm{SE}(n)$ equivariant and depend on relative states -- from external global field effects -- which depend on absolute states. We model interactions with equivariant graph networks, and combine them with neural fields in a novel graph network that integrates field forces. Our experiments show that we can accurately discover the underlying fields in charged particles settings, traffic scenes, and gravitational n-body problems, and effectively use them to learn the system and forecast future trajectories.
comment: In NeurIPS 2023. Source code: https://github.com/mkofinas/aether
♻ ☆ Dice Semimetric Losses: Optimizing the Dice Score with Soft Labels MICCAI 2023
The soft Dice loss (SDL) has taken a pivotal role in numerous automated segmentation pipelines in the medical imaging community. Over the last years, some reasons behind its superior functioning have been uncovered and further optimizations have been explored. However, there is currently no implementation that supports its direct utilization in scenarios involving soft labels. Hence, a synergy between the use of SDL and research leveraging the use of soft labels, also in the context of model calibration, is still missing. In this work, we introduce Dice semimetric losses (DMLs), which (i) are by design identical to SDL in a standard setting with hard labels, but (ii) can be employed in settings with soft labels. Our experiments on the public QUBIQ, LiTS and KiTS benchmarks confirm the potential synergy of DMLs with soft labels (e.g. averaging, label smoothing, and knowledge distillation) over hard labels (e.g. majority voting and random selection). As a result, we obtain superior Dice scores and model calibration, which supports the wider adoption of DMLs in practice. The code is available at https://github.com/zifuwanggg/JDTLosses
comment: MICCAI 2023
♻ ☆ Weight-Inherited Distillation for Task-Agnostic BERT Compression NAACL2024
Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model. These methods transfer the knowledge in an indirect way. In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher. WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation. Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization. Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines. Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions. The code is available at https://github.com/wutaiqiang/WID-NAACL2024.
comment: 9 pages, 4 figures, NAACL2024 findings
♻ ☆ S$Ω$I: Score-based O-INFORMATION Estimation
The analysis of scientific data and complex multivariate systems requires information quantities that capture relationships among multiple random variables. Recently, new information-theoretic measures have been developed to overcome the shortcomings of classical ones, such as mutual information, that are restricted to considering pairwise interactions. Among them, the concept of information synergy and redundancy is crucial for understanding the high-order dependencies between variables. One of the most prominent and versatile measures based on this concept is O-information, which provides a clear and scalable way to quantify the synergy-redundancy balance in multivariate systems. However, its practical application is limited to simplified cases. In this work, we introduce S$\Omega$I, which allows for the first time to compute O-information without restrictive assumptions about the system. Our experiments validate our approach on synthetic data, and demonstrate the effectiveness of S$\Omega$I in the context of a real-world use case.
♻ ☆ Guaranteeing Control Requirements via Reward Shaping in Reinforcement Learning
In addressing control problems such as regulation and tracking through reinforcement learning, it is often required to guarantee that the acquired policy meets essential performance and stability criteria such as a desired settling time and steady-state error prior to deployment. Motivated by this necessity, we present a set of results and a systematic reward shaping procedure that (i) ensures the optimal policy generates trajectories that align with specified control requirements and (ii) allows to assess whether any given policy satisfies them. We validate our approach through comprehensive numerical experiments conducted in two representative environments from OpenAI Gym: the Inverted Pendulum swing-up problem and the Lunar Lander. Utilizing both tabular and deep reinforcement learning methods, our experiments consistently affirm the efficacy of our proposed framework, highlighting its effectiveness in ensuring policy adherence to the prescribed control requirements.
♻ ☆ The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection ICASSP 2024
As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable levels. Addressing this, our work aims to strike a delicate balance between computational efficiency and model accuracy, a persisting challenge in the field. We introduce a novel method that employs core subset selection for reweighting, effectively optimizing both computational time and model performance. By focusing on a strategically selected coreset, our approach offers a robust representation, as it efficiently minimizes the influence of outliers. The re-calibrated weights are then mapped back to and propagated across the entire dataset. Our experimental results substantiate the effectiveness of this approach, underscoring its potential as a scalable and precise solution for model training.
comment: Accepted to ICASSP 2024
♻ ☆ Observational and Experimental Insights into Machine Learning-Based Defect Classification in Wafers
This survey paper offers a comprehensive review of methodologies utilizing machine learning (ML) classification techniques for identifying wafer defects in semiconductor manufacturing. Despite the growing body of research demonstrating the effectiveness of ML in wafer defect identification, there is a noticeable absence of comprehensive reviews on this subject. This survey attempts to fill this void by amalgamating available literature and providing an in-depth analysis of the advantages, limitations, and potential applications of various ML classification algorithms in the realm of wafer defect detection. An innovative taxonomy of methodologies that we present provides a detailed classification of algorithms into more refined categories and techniques. This taxonomy follows a three-tier structure, starting from broad methodology categories and ending with specific techniques. It aids researchers in comprehending the complex relationships between different algorithms and their techniques. We employ a rigorous Observational and experimental evaluation to rank these varying techniques. For the Observational evaluation, we assess techniques based on a set of four criteria. The experimental evaluation ranks the algorithms employing the same techniques, sub-categories, and categories. Also the paper illuminates the future prospects of ML classification techniques for wafer defect identification, underscoring potential advancements and opportunities for further research in this field
♻ ☆ Interpretable Meta-Learning of Physical Systems
Machine learning methods can be a valuable aid in the scientific process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recent meta-learning methods have made significant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. Leveraging the structure of the learning problem, we argue that multi-environment generalization can be achieved using a simpler learning model, with an affine structure with respect to the learning task. Crucially, we prove that this architecture can identify the physical parameters of the system, enabling interpreable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to physical-parameter-induced adaptation and to adaptive control.
♻ ☆ Bounce: Reliable High-Dimensional Bayesian Optimization for Combinatorial and Mixed Spaces
Impactful applications such as materials discovery, hardware design, neural architecture search, or portfolio optimization require optimizing high-dimensional black-box functions with mixed and combinatorial input spaces. While Bayesian optimization has recently made significant progress in solving such problems, an in-depth analysis reveals that the current state-of-the-art methods are not reliable. Their performances degrade substantially when the unknown optima of the function do not have a certain structure. To fill the need for a reliable algorithm for combinatorial and mixed spaces, this paper proposes Bounce that relies on a novel map of various variable types into nested embeddings of increasing dimensionality. Comprehensive experiments show that Bounce reliably achieves and often even improves upon state-of-the-art performance on a variety of high-dimensional problems.
comment: 30 pages, 22 figures
♻ ☆ Riemannian Multinomial Logistics Regression for SPD Neural Networks CVPR 2024
Deep neural networks for learning Symmetric Positive Definite (SPD) matrices are gaining increasing attention in machine learning. Despite the significant progress, most existing SPD networks use traditional Euclidean classifiers on an approximated space rather than intrinsic classifiers that accurately capture the geometry of SPD manifolds. Inspired by Hyperbolic Neural Networks (HNNs), we propose Riemannian Multinomial Logistics Regression (RMLR) for the classification layers in SPD networks. We introduce a unified framework for building Riemannian classifiers under the metrics pulled back from the Euclidean space, and showcase our framework under the parameterized Log-Euclidean Metric (LEM) and Log-Cholesky Metric (LCM). Besides, our framework offers a novel intrinsic explanation for the most popular LogEig classifier in existing SPD networks. The effectiveness of our method is demonstrated in three applications: radar recognition, human action recognition, and electroencephalography (EEG) classification. The code is available at https://github.com/GitZH-Chen/SPDMLR.git.
comment: Accepted to CVPR 2024
♻ ☆ Asymptotic generalization error of a single-layer graph convolutional network
While graph convolutional networks show great practical promises, the theoretical understanding of their generalization properties as a function of the number of samples is still in its infancy compared to the more broadly studied case of supervised fully connected neural networks. In this article, we predict the performances of a single-layer graph convolutional network (GCN) trained on data produced by attributed stochastic block models (SBMs) in the high-dimensional limit. Previously, only ridge regression on contextual-SBM (CSBM) has been considered in Shi et al. 2022; we generalize the analysis to arbitrary convex loss and regularization for the CSBM and add the analysis for another data model, the neural-prior SBM. We also study the high signal-to-noise ratio limit, detail the convergence rates of the GCN and show that, while consistent, it does not reach the Bayes-optimal rate for any of the considered cases.
♻ ☆ NetInfoF Framework: Measuring and Exploiting Network Usable Information ICLR 2024
Given a node-attributed graph, and a graph task (link prediction or node classification), can we tell if a graph neural network (GNN) will perform well? More specifically, do the graph structure and the node features carry enough usable information for the task? Our goals are (1) to develop a fast tool to measure how much information is in the graph structure and in the node features, and (2) to exploit the information to solve the task, if there is enough. We propose NetInfoF, a framework including NetInfoF_Probe and NetInfoF_Act, for the measurement and the exploitation of network usable information (NUI), respectively. Given a graph data, NetInfoF_Probe measures NUI without any model training, and NetInfoF_Act solves link prediction and node classification, while two modules share the same backbone. In summary, NetInfoF has following notable advantages: (a) General, handling both link prediction and node classification; (b) Principled, with theoretical guarantee and closed-form solution; (c) Effective, thanks to the proposed adjustment to node similarity; (d) Scalable, scaling linearly with the input size. In our carefully designed synthetic datasets, NetInfoF correctly identifies the ground truth of NUI and is the only method being robust to all graph scenarios. Applied on real-world datasets, NetInfoF wins in 11 out of 12 times on link prediction compared to general GNN baselines.
comment: Accepted to ICLR 2024 (Spotlight)
♻ ☆ ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models CHI 2024
Exploring alternative ideas by rewriting text is integral to the writing process. State-of-the-art Large Language Models (LLMs) can simplify writing variation generation. However, current interfaces pose challenges for simultaneous consideration of multiple variations: creating new variations without overwriting text can be difficult, and pasting them sequentially can clutter documents, increasing workload and disrupting writers' flow. To tackle this, we present ABScribe, an interface that supports rapid, yet visually structured, exploration and organization of writing variations in human-AI co-writing tasks. With ABScribe, users can swiftly modify variations using LLM prompts, which are auto-converted into reusable buttons. Variations are stored adjacently within text fields for rapid in-place comparisons using mouse-over interactions on a popup toolbar. Our user study with 12 writers shows that ABScribe significantly reduces task workload (d = 1.20, p < 0.001), enhances user perceptions of the revision process (d = 2.41, p < 0.001) compared to a popular baseline workflow, and provides insights into how writers explore variations using LLMs.
comment: CHI 2024
♻ ☆ On the Privacy Effect of Data Enhancement via the Lens of Memorization
Machine learning poses severe privacy concerns as it has been shown that the learned models can reveal sensitive information about their training data. Many works have investigated the effect of widely adopted data augmentation and adversarial training techniques, termed data enhancement in the paper, on the privacy leakage of machine learning models. Such privacy effects are often measured by membership inference attacks (MIAs), which aim to identify whether a particular example belongs to the training set or not. We propose to investigate privacy from a new perspective called memorization. Through the lens of memorization, we find that previously deployed MIAs produce misleading results as they are less likely to identify samples with higher privacy risks as members compared to samples with low privacy risks. To solve this problem, we deploy a recent attack that can capture individual samples' memorization degrees for evaluation. Through extensive experiments, we unveil several findings about the connections between three essential properties of machine learning models, including privacy, generalization gap, and adversarial robustness. We demonstrate that the generalization gap and privacy leakage are less correlated than those of the previous results. Moreover, there is not necessarily a trade-off between adversarial robustness and privacy as stronger adversarial robustness does not make the model more susceptible to privacy attacks.
comment: Accepted by IEEE TIFS, 17 pages
♻ ☆ Are Ensembles Getting Better all the Time?
Ensemble methods combine the predictions of several base models. We study whether or not including more models always improves their average performance. This question depends on the kind of ensemble considered, as well as the predictive metric chosen. We focus on situations where all members of the ensemble are a priori expected to perform as well, which is the case of several popular methods such as random forests or deep ensembles. In this setting, we show that ensembles are getting better all the time if, and only if, the considered loss function is convex. More precisely, in that case, the average loss of the ensemble is a decreasing function of the number of models. When the loss function is nonconvex, we show a series of results that can be summarised as: ensembles of good models keep getting better, and ensembles of bad models keep getting worse. To this end, we prove a new result on the monotonicity of tail probabilities that may be of independent interest. We illustrate our results on a medical prediction problem (diagnosing melanomas using neural nets) and a "wisdom of crowds" experiment (guessing the ratings of upcoming movies).
♻ ☆ Surfer: Progressive Reasoning with World Models for Robotic Manipulation
Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data. However, most existing vision and language robot manipulation methods mainly operate in less realistic simulator and language settings and lack explicit modeling of world knowledge. To bridge this gap, we introduce a novel and simple robot manipulation framework, called Surfer. It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene. Then, the generalization ability of the model on new instructions and new scenes is enhanced by explicit modeling of the action and scene prediction in multi-modal information. In addition to the framework, we also built a robot manipulation simulator that supports full physics execution based on the MuJoCo physics engine. It can automatically generate demonstration training data and test data, effectively reducing labor costs. To conduct a comprehensive and systematic evaluation of the robot manipulation model in terms of language understanding and physical execution, we also created a robotic manipulation benchmark with progressive reasoning tasks, called SeaWave. It contains 4 levels of progressive reasoning tasks and can provide a standardized testing platform for embedded AI agents in multi-modal environments. On average, Surfer achieved a success rate of 54.74% on the defined four levels of manipulation tasks, exceeding the best baseline performance of 47.64%.
♻ ☆ Vulnerability analysis of captcha using Deep learning
Several websites improve their security and avoid dangerous Internet attacks by implementing CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), a type of verification to identify whether the end-user is human or a robot. The most prevalent type of CAPTCHA is text-based, designed to be easily recognized by humans while being unsolvable towards machines or robots. However, as deep learning technology progresses, development of convolutional neural network (CNN) models that predict text-based CAPTCHAs becomes easier. The purpose of this research is to investigate the flaws and vulnerabilities in the CAPTCHA generating systems in order to design more resilient CAPTCHAs. To achieve this, we created CapNet, a Convolutional Neural Network. The proposed platform can evaluate both numerical and alphanumerical CAPTCHAs
♻ ☆ Analyzing and Improving the Training Dynamics of Diffusion Models
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
♻ ☆ Real-Fake: Effective Training Data Synthesis Through Distribution Matching
Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits such as out-of-distribution generalization, privacy preservation, and scalability. Specifically, we achieve 70.9% top1 classification accuracy on ImageNet1K when training solely with synthetic data equivalent to 1 X the original real data size, which increases to 76.0% when scaling up to 10 X synthetic data.
comment: Code released at (https://github.com/BAAI-DCAI/Training-Data-Synthesis)
♻ ☆ Pseudo-rigid body networks: learning interpretable deformable object dynamics from partial observations
Accurate prediction of deformable linear object (DLO) dynamics is challenging if the task at hand requires a human-interpretable yet computationally fast model. In this work, we draw inspiration from the pseudo-rigid body method (PRB) and model a DLO as a serial chain of rigid bodies whose internal state is unrolled through time by a dynamics network. This dynamics network is trained jointly with a physics-informed encoder which maps observed motion variables to the DLO's hidden state. To encourage that the state acquires a physically meaningful representation, we leverage the forward kinematics of the PRB model as decoder. We demonstrate in robot experiments that the proposed DLO dynamics model provides physically interpretable predictions from partial observations while being on par with black-box models regarding prediction accuracy. The project code is available at: http://tinyurl.com/prb-networks
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Data Augmentation for Time-Series Classification: An Extensive Empirical Study and Comprehensive Survey
Data Augmentation (DA) has emerged as an indispensable strategy in Time Series Classification (TSC), primarily due to its capacity to amplify training samples, thereby bolstering model robustness, diversifying datasets, and curtailing overfitting. However, the current landscape of DA in TSC is plagued with fragmented literature reviews, nebulous methodological taxonomies, inadequate evaluative measures, and a dearth of accessible, user-oriented tools. In light of these challenges, this study embarks on an exhaustive dissection of DA methodologies within the TSC realm. Our initial approach involved an extensive literature review spanning a decade, revealing that contemporary surveys scarcely capture the breadth of advancements in DA for TSC, prompting us to meticulously analyze over 100 scholarly articles to distill more than 60 unique DA techniques. This rigorous analysis precipitated the formulation of a novel taxonomy, purpose-built for the intricacies of DA in TSC, categorizing techniques into five principal echelons: Transformation-Based, Pattern-Based, Generative, Decomposition-Based, and Automated Data Augmentation. Our taxonomy promises to serve as a robust navigational aid for scholars, offering clarity and direction in method selection. Addressing the conspicuous absence of holistic evaluations for prevalent DA techniques, we executed an all-encompassing empirical assessment, wherein upwards of 15 DA strategies were subjected to scrutiny across 8 UCR time-series datasets, employing ResNet and a multi-faceted evaluation paradigm encompassing Accuracy, Method Ranking, and Residual Analysis, yielding a benchmark accuracy of 88.94 +- 11.83%. Our investigation underscored the inconsistent efficacies of DA techniques, with...
♻ ☆ Diffusive Gibbs Sampling
The inadequate mixing of conventional Markov Chain Monte Carlo (MCMC) methods for multi-modal distributions presents a significant challenge in practical applications such as Bayesian inference and molecular dynamics. Addressing this, we propose Diffusive Gibbs Sampling (DiGS), an innovative family of sampling methods designed for effective sampling from distributions characterized by distant and disconnected modes. DiGS integrates recent developments in diffusion models, leveraging Gaussian convolution to create an auxiliary noisy distribution that bridges isolated modes in the original space and applying Gibbs sampling to alternately draw samples from both spaces. Our approach exhibits a better mixing property for sampling multi-modal distributions than state-of-the-art methods such as parallel tempering. We demonstrate that our sampler attains substantially improved results across various tasks, including mixtures of Gaussians, Bayesian neural networks and molecular dynamics.
comment: 15 pages, 11 figures, 4 tables, 1 algorithm. Code available: https://github.com/Wenlin-Chen/DiGS
♻ ☆ From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication
It has been observed that representations learned by distinct neural networks conceal structural similarities when the models are trained under similar inductive biases. From a geometric perspective, identifying the classes of transformations and the related invariances that connect these representations is fundamental to unlocking applications, such as merging, stitching, and reusing different neural modules. However, estimating task-specific transformations a priori can be challenging and expensive due to several factors (e.g., weights initialization, training hyperparameters, or data modality). To this end, we introduce a versatile method to directly incorporate a set of invariances into the representations, constructing a product space of invariant components on top of the latent representations without requiring prior knowledge about the optimal invariance to infuse. We validate our solution on classification and reconstruction tasks, observing consistent latent similarity and downstream performance improvements in a zero-shot stitching setting. The experimental analysis comprises three modalities (vision, text, and graphs), twelve pretrained foundational models, nine benchmarks, and several architectures trained from scratch.
comment: 41 pages, 14 figures and 31 tables
♻ ☆ Learning to Predict Short-Term Volatility with Order Flow Image Representation
Introduction: The paper addresses the challenging problem of predicting the short-term realized volatility of the Bitcoin price using order flow information. The inherent stochastic nature and anti-persistence of price pose difficulties in accurate prediction. Methods: To address this, we propose a method that transforms order flow data over a fixed time interval (snapshots) into images. The order flow includes trade sizes, trade directions, and limit order book, and is mapped into image colour channels. These images are then used to train both a simple 3-layer Convolutional Neural Network (CNN) and more advanced ResNet-18 and ConvMixer, with additionally supplementing them with hand-crafted features. The models are evaluated against classical GARCH, Multilayer Perceptron trained on raw data, and a naive guess method that considers current volatility as a prediction. Results: The experiments are conducted using price data from January 2021 and evaluate model performance in terms of root mean square error (RMSPE). The results show that our order flow representation with a CNN as a predictive model achieves the best performance, with an RMSPE of 0.85+/-1.1 for the model with aggregated features and 1.0+/-1.4 for the model without feature supplementation. ConvMixer with feature supplementation follows closely. In comparison, the RMSPE for the naive guess method was 1.4+/-3.0.
♻ ☆ DDMI: Domain-Agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations
Recent studies have introduced a new class of generative models for synthesizing implicit neural representations (INRs) that capture arbitrary continuous signals in various domains. These models opened the door for domain-agnostic generative models, but they often fail to achieve high-quality generation. We observed that the existing methods generate the weights of neural networks to parameterize INRs and evaluate the network with fixed positional embeddings (PEs). Arguably, this architecture limits the expressive power of generative models and results in low-quality INR generation. To address this limitation, we propose Domain-agnostic Latent Diffusion Model for INRs (DDMI) that generates adaptive positional embeddings instead of neural networks' weights. Specifically, we develop a Discrete-to-continuous space Variational AutoEncoder (D2C-VAE), which seamlessly connects discrete data and the continuous signal functions in the shared latent space. Additionally, we introduce a novel conditioning mechanism for evaluating INRs with the hierarchically decomposed PEs to further enhance expressive power. Extensive experiments across four modalities, e.g., 2D images, 3D shapes, Neural Radiance Fields, and videos, with seven benchmark datasets, demonstrate the versatility of DDMI and its superior performance compared to the existing INR generative models.
♻ ☆ LNPT: Label-free Network Pruning and Training
Pruning before training enables the deployment of neural networks on smart devices. By retaining weights conducive to generalization, pruned networks can be accommodated on resource-constrained smart devices. It is commonly held that the distance on weight norms between the initialized and the fully-trained networks correlates with generalization performance. However, as we have uncovered, inconsistency between this metric and generalization during training processes, which poses an obstacle to determine the pruned structures on smart devices in advance. In this paper, we introduce the concept of the learning gap, emphasizing its accurate correlation with generalization. Experiments show that the learning gap, in the form of feature maps from the penultimate layer of networks, aligns with variations of generalization performance. We propose a novel learning framework, LNPT, which enables mature networks on the cloud to provide online guidance for network pruning and learning on smart devices with unlabeled data. Our results demonstrate the superiority of this approach over supervised training.
comment: 8 pages,7 figures
♻ ☆ General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History
Developing clinical prediction models (e.g., mortality prediction) based on electronic health records (EHRs) typically relies on expert opinion for feature selection and adjusting observation window size. This burdens experts and creates a bottleneck in the development process. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate an unlimited number of clinical events, select the relevant ones, and make predictions. This approach effectively eliminates the need for manual feature selection and enables an unrestricted observation window. We verified these properties through experiments on 27 clinical tasks and two independent cohorts from publicly available EHR datasets, where REMed outperformed other contemporary architectures that aim to handle as many events as possible. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.
comment: The source codes corresponding to this paper are available at: https://github.com/starmpcc/REMed
♻ ☆ Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials
Lattices are architected metamaterials whose properties strongly depend on their geometrical design. The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling. In this work, we generate a big dataset of structure-property relationships for strut-based lattices. The dataset is made available to the community which can fuel the development of methods anchored in physical principles for the fitting of fourth-order tensors. In addition, we present a higher-order GNN model trained on this dataset. The key features of the model are (i) SE(3) equivariance, and (ii) consistency with the thermodynamic law of conservation of energy. We compare the model to non-equivariant models based on a number of error metrics and demonstrate its benefits in terms of predictive performance and reduced training requirements. Finally, we demonstrate an example application of the model to an architected material design task. The methods which we developed are applicable to fourth-order tensors beyond elasticity such as piezo-optical tensor etc.
comment: International Conference on Learning Representations 2024
♻ ☆ REDS: Resource-Efficient Deep Subnetworks for Dynamic Resource Constraints
Deep models deployed on edge devices frequently encounter resource variability, which arises from fluctuating energy levels, timing constraints, or prioritization of other critical tasks within the system. State-of-the-art machine learning pipelines generate resource-agnostic models, not capable to adapt at runtime. In this work we introduce Resource-Efficient Deep Subnetworks (REDS) to tackle model adaptation to variable resources. In contrast to the state-of-the-art, REDS use structured sparsity constructively by exploiting permutation invariance of neurons, which allows for hardware-specific optimizations. Specifically, REDS achieve computational efficiency by (1) skipping sequential computational blocks identified by a novel iterative knapsack optimizer, and (2) leveraging simple math to re-arrange the order of operations in REDS computational graph to take advantage of the data cache. REDS support conventional deep networks frequently deployed on the edge and provide computational benefits even for small and simple networks. We evaluate REDS on seven benchmark architectures trained on the Visual Wake Words, Google Speech Commands, Fashion-MNIST and CIFAR10 datasets, and test on four off-the-shelf mobile and embedded hardware platforms. We provide a theoretical result and empirical evidence for REDS outstanding performance in terms of submodels' test set accuracy, and demonstrate an adaptation time in response to dynamic resource constraints of under 40$\mu$s, utilizing a 2-layer fully-connected network on Arduino Nano 33 BLE.
♻ ☆ Calibration of Deep Learning Classification Models in fNIRS
Functional near-infrared spectroscopy (fNIRS) is a valuable non-invasive tool for monitoring brain activity. The classification of fNIRS data in relation to conscious activity holds significance for advancing our understanding of the brain and facilitating the development of brain-computer interfaces (BCI). Many researchers have turned to deep learning to tackle the classification challenges inherent in fNIRS data due to its strong generalization and robustness. In the application of fNIRS, reliability is really important, and one mathematical formulation of the reliability of confidence is calibration. However, many researchers overlook the important issue of calibration. To address this gap, we propose integrating calibration into fNIRS field and assess the reliability of existing models. Surprisingly, our results indicate poor calibration performance in many proposed models. To advance calibration development in the fNIRS field, we summarize three practical tips. Through this letter, we hope to emphasize the critical role of calibration in fNIRS research and argue for enhancing the reliability of deep learning-based predictions in fNIRS classification tasks. All data from our experimental process are openly available on GitHub.
♻ ☆ Adaptive Message Passing: A General Framework to Mitigate Oversmoothing, Oversquashing, and Underreaching
Long-range interactions are essential for the correct description of complex systems in many scientific fields. The price to pay for including them in the calculations, however, is a dramatic increase in the overall computational costs. Recently, deep graph networks have been employed as efficient, data-driven surrogate models for predicting properties of complex systems represented as graphs. These models rely on a local and iterative message passing strategy that should, in principle, capture long-range information without explicitly modeling the corresponding interactions. In practice, most deep graph networks cannot really model long-range dependencies due to the intrinsic limitations of (synchronous) message passing, namely oversmoothing, oversquashing, and underreaching. This work proposes a general framework that learns to mitigate these limitations: within a variational inference framework, we endow message passing architectures with the ability to freely adapt their depth and filter messages along the way. With theoretical and empirical arguments, we show that this simple strategy better captures long-range interactions, by surpassing the state of the art on five node and graph prediction datasets suited for this problem. Our approach consistently improves the performances of the baselines tested on these tasks. We complement the exposition with qualitative analyses and ablations to get a deeper understanding of the framework's inner workings.
♻ ☆ Simple But Effective: Rethinking the Ability of Deep Learning in fNIRS to Exclude Abnormal Input
Functional near-infrared spectroscopy (fNIRS) is a non-invasive technique for monitoring brain activity. To better understand the brain, researchers often use deep learning to address the classification challenges of fNIRS data. Our study shows that while current networks in fNIRS are highly accurate for predictions within their training distribution, they falter at identifying and excluding abnormal data which is out-of-distribution, affecting their reliability. We propose integrating metric learning and supervised methods into fNIRS research to improve networks capability in identifying and excluding out-of-distribution outliers. This method is simple yet effective. In our experiments, it significantly enhances the performance of various networks in fNIRS, particularly transformer-based one, which shows the great improvement in reliability. We will make our experiment data available on GitHub.
♻ ☆ Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides
Digital pathology enables automatic analysis of histopathological sections using artificial intelligence (AI). Automatic evaluation could improve diagnostic efficiency and help find associations between morphological features and clinical outcome. For development of such prediction models, identifying invasive epithelial cells, and separating these from benign epithelial cells and in situ lesions would be the first step. In this study, we aimed to develop an AI model for segmentation of epithelial cells in sections from breast cancer. We generated epithelial ground truth masks by restaining hematoxylin and eosin (HE) sections with cytokeratin (CK) AE1/AE3, and by pathologists' annotations. HE/CK image pairs were used to train a convolutional neural network, and data augmentation was used to make the model more robust. Tissue microarrays (TMAs) from 839 patients, and whole slide images from two patients were used for training and evaluation of the models. The sections were derived from four cohorts of breast cancer patients. TMAs from 21 patients from a fifth cohort was used as a second test set. In quantitative evaluation, a mean Dice score of 0.70, 0.79, and 0.75 for invasive epithelial cells, benign epithelial cells, and in situ lesions, respectively, were achieved. In qualitative scoring (0-5) by pathologists, results were best for all epithelium and invasive epithelium, with scores of 4.7 and 4.4. Scores for benign epithelium and in situ lesions were 3.7 and 2.0. The proposed model segmented epithelial cells in HE stained breast cancer slides well, but further work is needed for accurate division between the classes. Immunohistochemistry, together with pathologists' annotations, enabled the creation of accurate ground truths. The model is made freely available in FastPathology and the code is available at https://github.com/AICAN-Research/breast-epithelium-segmentation
comment: 19 pages, 6 figures. Submitted to a scientific journal
♻ ☆ Lossless Point Cloud Geometry and Attribute Compression Using a Learned Conditional Probability Model
In recent years, we have witnessed the presence of point cloud data in many aspects of our life, from immersive media, autonomous driving to healthcare, although at the cost of a tremendous amount of data. In this paper, we present an efficient lossless point cloud compression method that uses sparse tensor-based deep neural networks to learn point cloud geometry and color probability distributions. Our method represents a point cloud with both occupancy feature and three attribute features at different bit depths in a unified sparse representation. This allows us to efficiently exploit feature-wise and point-wise dependencies within point clouds using a sparse tensor-based neural network and thus build an accurate auto-regressive context model for an arithmetic coder. To the best of our knowledge, this is the first learning-based lossless point cloud geometry and attribute compression approach. Compared with the-state-of-the-art lossless point cloud compression method from Moving Picture Experts Group (MPEG), our method achieves 22.6% reduction in total bitrate on a diverse set of test point clouds while having 49.0% and 18.3% rate reduction on geometry and color attribute component, respectively.
comment: 12 pages, accepted to IEEE Transactions on Circuits and Systems for Video Technology
♻ ☆ In Search of Truth: An Interrogation Approach to Hallucination Detection
Despite the many advances of Large Language Models (LLMs) and their unprecedented rapid evolution, their impact and integration into every facet of our daily lives is limited due to various reasons. One critical factor hindering their widespread adoption is the occurrence of hallucinations, where LLMs invent answers that sound realistic, yet drift away from factual truth. In this paper, we present a novel method for detecting hallucinations in large language models, which tackles a critical issue in the adoption of these models in various real-world scenarios. Through extensive evaluations across multiple datasets and LLMs, including Llama-2, we study the hallucination levels of various recent LLMs and demonstrate the effectiveness of our method to automatically detect them. Notably, we observe up to 62% hallucinations for Llama-2 in a specific experiment, where our method achieves a Balanced Accuracy (B-ACC) of 87%, all without relying on external knowledge.
♻ ☆ Sparsification of the regularized magnetic Laplacian with multi-type spanning forests
In this paper, we consider a ${\rm U}(1)$-connection graph, that is, a graph where each oriented edge is endowed with a unit modulus complex number that is conjugated under orientation flip. A natural replacement for the combinatorial Laplacian is then the magnetic Laplacian, an Hermitian matrix that includes information about the graph's connection. Magnetic Laplacians appear, e.g., in the problem of angular synchronization. In the context of large and dense graphs, we study here sparsifiers of the magnetic Laplacian $\Delta$, i.e., spectral approximations based on subgraphs with few edges. Our approach relies on sampling multi-type spanning forests (MTSFs) using a custom determinantal point process, a probability distribution over edges that favours diversity. In a word, an MTSF is a spanning subgraph whose connected components are either trees or cycle-rooted trees. The latter partially capture the angular inconsistencies of the connection graph, and thus provide a way to compress the information contained in the connection. Interestingly, when the connection graph has weakly inconsistent cycles, samples from the determinantal point process under consideration can be obtained \`a la Wilson, using a random walk with cycle popping. We provide statistical guarantees for a choice of natural estimators of the connection Laplacian, and investigate two practical applications of our sparsifiers: ranking with angular synchronization and graph-based semi-supervised learning. From a statistical perspective, a side result of this paper of independent interest is a matrix Chernoff bound with intrinsic dimension, which allows considering the influence of a regularization -- of the form $\Delta + q \mathbb{I}$ with $q>0$ -- on sparsification guarantees.
comment: 51 pages, 15 figures. Improved presentation of the theoretical results and simulations of larger scale
♻ ☆ Machine learning approach to detect dynamical states from recurrence measures
We integrate machine learning approaches with nonlinear time series analysis, specifically utilizing recurrence measures to classify various dynamical states emerging from time series. We implement three machine learning algorithms Logistic Regression, Random Forest, and Support Vector Machine for this study. The input features are derived from the recurrence quantification of nonlinear time series and characteristic measures of the corresponding recurrence networks. For training and testing we generate synthetic data from standard nonlinear dynamical systems and evaluate the efficiency and performance of the machine learning algorithms in classifying time series into periodic, chaotic, hyper-chaotic, or noisy categories. Additionally, we explore the significance of input features in the classification scheme and find that the features quantifying the density of recurrence points are the most relevant. Furthermore, we illustrate how the trained algorithms can successfully predict the dynamical states of two variable stars, SX Her and AC Her from the data of their light curves.
♻ ☆ MELTing point: Mobile Evaluation of Language Transformers
Transformers have revolutionized the machine learning landscape, gradually making their way into everyday tasks and equipping our computers with ``sparks of intelligence''. However, their runtime requirements have prevented them from being broadly deployed on mobile. As personal devices become increasingly powerful and prompt privacy becomes an ever more pressing issue, we explore the current state of mobile execution of Large Language Models (LLMs). To achieve this, we have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device, supporting different models, devices and frameworks, including Android, iOS and Nvidia Jetson devices. We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance, tracing their memory and energy requirements along the way. Our analysis is the first systematic study of on-device LLM execution, quantifying performance, energy efficiency and accuracy across various state-of-the-art models and showcases the state of on-device intelligence in the era of hyperscale models. Results highlight the performance heterogeneity across targets and corroborates that LLM inference is largely memory-bound. Quantization drastically reduces memory requirements and renders execution viable, but at a non-negligible accuracy cost. Drawing from its energy footprint and thermal behavior, the continuous execution of LLMs remains elusive, as both factors negatively affect user experience. Last, our experience shows that the ecosystem is still in its infancy, and algorithmic as well as hardware breakthroughs can significantly shift the execution cost. We expect NPU acceleration, and framework-hardware co-design to be the biggest bet towards efficient standalone execution, with the alternative of offloading tailored towards edge deployments.
comment: Under review
♻ ☆ MC-DBN: A Deep Belief Network-Based Model for Modality Completion
Recent advancements in multi-modal artificial intelligence (AI) have revolutionized the fields of stock market forecasting and heart rate monitoring. Utilizing diverse data sources can substantially improve prediction accuracy. Nonetheless, additional data may not always align with the original dataset. Interpolation methods are commonly utilized for handling missing values in modal data, though they may exhibit limitations in the context of sparse information. Addressing this challenge, we propose a Modality Completion Deep Belief Network-Based Model (MC-DBN). This approach utilizes implicit features of complete data to compensate for gaps between itself and additional incomplete data. It ensures that the enhanced multi-modal data closely aligns with the dynamic nature of the real world to enhance the effectiveness of the model. We conduct evaluations of the MC-DBN model in two datasets from the stock market forecasting and heart rate monitoring domains. Comprehensive experiments showcase the model's capacity to bridge the semantic divide present in multi-modal data, subsequently enhancing its performance. The source code is available at: https://github.com/logan-0623/DBN-generate
♻ ☆ Learning Adversarial MDPs with Stochastic Hard Constraints
We study online learning problems in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints. We consider two different scenarios. In the first one, we address general CMDPs, where we design an algorithm that attains sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that the constraints are satisfied at every episode with high probability. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Indeed, previous works either focus on much weaker soft constraints--allowing for positive violation to cancel out negative ones--or are restricted to stochastic losses. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with state-of-the-art algorithms. This enables their adoption in a much wider range of real-world applications, ranging from autonomous driving to online advertising and recommender systems.
♻ ☆ An Exploratory Study on Automatic Identification of Assumptions in the Development of Deep Learning Frameworks
Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs (e.g., time and resources). To overcome the issues of manually identifying assumptions in DL framework development, we constructed a new and largest dataset (i.e., AssuEval) of assumptions collected from the TensorFlow and Keras repositories on GitHub; explored the performance of seven traditional machine learning models (e.g., Support Vector Machine, Classification and Regression Trees), a popular DL model (i.e., ALBERT), and a large language model (i.e., ChatGPT) of identifying assumptions on the AssuEval dataset. The experiment results show that: ALBERT achieves the best performance (f1-score: 0.9584) of identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.6211, achieved by ChatGPT). Though ChatGPT is the most popular large language model, we do not recommend using it to identify assumptions in DL framework development because of its low performance on the task. Fine-tuning ChatGPT specifically for assumption identification could improve the performance. This study provides researchers with the largest dataset of assumptions for further research (e.g., assumption classification, evaluation, and reasoning) and helps practitioners better understand assumptions and how to manage them in their projects.
comment: 28 pages, 15 images, 10 tables, Manuscript submitted to a Journal (2024)
♻ ☆ Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum
Wide neural networks are biased towards learning certain functions, influencing both the rate of convergence of gradient descent (GD) and the functions that are reachable with GD in finite training time. As such, there is a great need for methods that can modify this bias according to the task at hand. To that end, we introduce Modified Spectrum Kernels (MSKs), a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known. We leverage the duality between wide neural networks and Neural Tangent Kernels and propose a preconditioned gradient descent method, which alters the trajectory of GD. As a result, this allows for a polynomial and, in some cases, exponential training speedup without changing the final solution. Our method is both computationally efficient and simple to implement.
♻ ☆ CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM
Automatic Chinese classical poetry generation has attracted much research interest, but achieving effective control over format and content simultaneously remains challenging. Traditional systems usually accept keywords as user inputs, resulting in limited control over content. Large language models (LLMs) improve content control by allowing unrestricted user instructions, but the token-by-token generation process frequently makes format errors. Motivated by this, we propose CharPoet, a Chinese classical poetry generation system based on token-free LLM, which provides effective control over both format and content. Our token-free architecture generates in a character-by-character manner, enabling precise control over the number of characters. Pruned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like "Write me a poem for my mother's birthday." CharPoet achieves format accuracy above 0.96, outperforming Jiuge-GPT-2 (0.91) and GPT-4 (0.38). In terms of content quality, CharPoet surpasses traditional systems including Jiuge, and is comparable to other LLMs. Our system is open source and available at https://modelscope.cn/models/CharPoet/CharPoet. A video demonstration of CharPoet is available at https://youtu.be/voZ25qEp3Dc.
♻ ☆ Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training
While large language models (LLMs) have achieved impressive performance across diverse tasks, recent studies showcase that causal LLMs suffer from the "reversal curse". It is a typical example that the model knows "A's father is B", but is unable to reason "B's child is A". This limitation poses a challenge to the advancement of artificial general intelligence (AGI), as it suggests a gap in the models' ability to comprehend and apply bidirectional reasoning. In this paper, we first conduct substantial evaluation and identify that the root cause of the reversal curse lies in the different word order between the training and inference stage, namely, the poor ability of causal language models to predict antecedent words within the training data. Accordingly, permutation on the training data is considered as a potential solution, since this can make the model predict antecedent words or tokens. However, previous permutation methods may disrupt complete phrases or entities, thereby posing challenges for the model to comprehend and learn from training data. To address this issue, we propose Semantic-aware Permutation Training (SPT), which addresses this issue by segmenting the training sentences into semantic units (i.e., entities or phrases) with an assistant language model and permuting these units before feeding into the model. Extensive experiments demonstrate that SPT effectively mitigates the reversal curse since the performance on reversed questions approximates that on the forward ones, and significantly advances the performance of existing works.
♻ ☆ A Pre-trained Data Deduplication Model based on Active Learning
In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28% improvement in Recall score on benchmark datasets.
♻ ☆ Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation ICLR 2024
Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines (up to 15%) for the LLP Binary Classification problem on various dataset types - tabular and Image. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.
comment: Published as a conference paper at The Twelfth International Conference on Learning Representations (ICLR 2024) & Oral Presentation at Regulatable ML @ NeurIPS 2023
♻ ☆ AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models
Existing customization methods require access to multiple reference examples to align pre-trained diffusion probabilistic models (DPMs) with user-provided concepts. This paper aims to address the challenge of DPM customization when the only available supervision is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, na\"ive gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. Finally, we demonstrate the effectiveness of AdjointDPM on three interesting tasks: converting visual effects into identification text embeddings, finetuning DPMs for specific types of stylization, and optimizing initial noise to generate adversarial samples for security auditing.
♻ ☆ Lightweight, error-tolerant edge detection using memristor-enabled stochastic logics
The demand for efficient edge vision has spurred the interest in developing stochastic computing approaches for performing image processing tasks. Memristors with inherent stochasticity readily introduce probability into the computations and thus enable stochastic image processing computations. Here, we present a stochastic computing approach for edge detection, a fundamental image processing technique, facilitated with memristor-enabled stochastic logics. Specifically, we integrate the memristors with logic circuits and harness the stochasticity from the memristors to realize compact stochastic logics for stochastic number encoding and processing. The stochastic numbers, exhibiting well-regulated probabilities and correlations, can be processed to perform logic operations with statistical probabilities. This can facilitate lightweight stochastic edge detection for edge visual scenarios characterized with high-level noise errors. As a practical demonstration, we implement a hardware stochastic Roberts cross operator using the stochastic logics, and prove its exceptional edge detection performance, remarkably, with 95% less computational cost while withstanding 50% bit-flip errors. The results underscore the great potential of our stochastic edge detection approach in developing lightweight, error-tolerant edge vision hardware and systems for autonomous driving, virtual/augmented reality, medical imaging diagnosis, industrial automation, and beyond.
♻ ☆ SAM-OCTA: Prompting Segment-Anything for OCTA Image Segmentation
Segmenting specific targets or biomarkers is necessary to analyze optical coherence tomography angiography (OCTA) images. Previous methods typically segment all the targets in an OCTA sample, such as retinal vessels (RVs). Although these methods perform well in accuracy and precision, OCTA analyses often focusing local information within the images which has not been fulfilled. In this paper, we propose a method called SAM-OCTA for local segmentation in OCTA images. The method fine-tunes a pre-trained segment anything model (SAM) using low-rank adaptation (LoRA) and utilizes prompt points for local RVs, arteries, and veins segmentation in OCTA. To explore the effect and mechanism of prompt points, we set up global and local segmentation modes with two prompt point generation strategies, namely random selection and special annotation. Considering practical usage, we conducted extended experiments with different model scales and analyzed the model performance before and after fine-tuning besides the general segmentation task. From comprehensive experimental results with the OCTA-500 dataset, our SAM-OCTA method has achieved state-of-the-art performance in common OCTA segmentation tasks related to RV and FAZ, and it also performs accurate segmentation of artery-vein and local vessels. The code is available at https://github.com/ShellRedia/SAM-OCTA-extend.
comment: arXiv admin note: text overlap with arXiv:2309.11758
♻ ☆ AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \textit{AgentOhana} aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It meticulously standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present \textbf{xLAM-v0.1}, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks. Begin the exploration at \url{https://github.com/SalesforceAIResearch/xLAM}.
comment: Add GitHub repo link at \url{https://github.com/SalesforceAIResearch/xLAM} and HuggingFace model link at \url{https://huggingface.co/Salesforce/xLAM-v0.1-r}
♻ ☆ Vehicle Dispatching and Routing of On-Demand Intercity Ride-Pooling Services: A Multi-Agent Hierarchical Reinforcement Learning Approach
The integrated development of city clusters has given rise to an increasing demand for intercity travel. Intercity ride-pooling service exhibits considerable potential in upgrading traditional intercity bus services by implementing demand-responsive enhancements. Nevertheless, its online operations suffer the inherent complexities due to the coupling of vehicle resource allocation among cities and pooled-ride vehicle routing. To tackle these challenges, this study proposes a two-level framework designed to facilitate online fleet management. Specifically, a novel multi-agent feudal reinforcement learning model is proposed at the upper level of the framework to cooperatively assign idle vehicles to different intercity lines, while the lower level updates the routes of vehicles using an adaptive large neighborhood search heuristic. Numerical studies based on the realistic dataset of Xiamen and its surrounding cities in China show that the proposed framework effectively mitigates the supply and demand imbalances, and achieves significant improvement in both the average daily system profit and order fulfillment ratio.
♻ ☆ Exploring the Privacy-Energy Consumption Tradeoff for Split Federated Learning
Split Federated Learning (SFL) has recently emerged as a promising distributed learning technology, leveraging the strengths of both federated and split learning. It emphasizes the advantages of rapid convergence while addressing privacy concerns. As a result, this innovation has received significant attention from both industry and academia. However, since the model is split at a specific layer, known as a cut layer, into both client-side and server-side models for the SFL, the choice of the cut layer in SFL can have a substantial impact on the energy consumption of clients and their privacy, as it influences the training burden and the output of the client-side models. In this article, we provide a comprehensive overview of the SFL process and thoroughly analyze energy consumption and privacy. This analysis considers the influence of various system parameters on the cut layer selection strategy. Additionally, we provide an illustrative example of the cut layer selection, aiming to minimize clients' risk of reconstructing the raw data at the server while sustaining energy consumption within the required energy budget, which involves trade-offs. Finally, we address open challenges in this field. These directions represent promising avenues for future research and development.
comment: 10 pages, 5 figures
♻ ☆ Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
♻ ☆ Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites
As large language models (LLMs) like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. To do this, we train a DeBERTa-based synthetic news detector and classify over 15.46 million articles from 3,074 misinformation and mainstream news websites. We find that between January 1, 2022, and May 1, 2023, the relative number of synthetic news articles increased by 57.3% on mainstream websites while increasing by 474% on misinformation sites. We find that this increase is largely driven by smaller less popular websites. Analyzing the impact of the release of ChatGPT using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites.
comment: Accepted to ICWSM 2024
♻ ☆ Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
From content moderation to wildlife conservation, the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally, developing classifiers for such concepts requires substantial manual effort measured in hours, days, or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques, which enable rapid bootstrapping of image classifiers, users are still required to spend 30 minutes or more of monotonous, repetitive data labeling just to train a single classifier. Drawing on Fiske's Cognitive Miser theory, we propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions, reducing the total effort required to define a concept by an order of magnitude: from labeling 2,000 images to only 100 plus some natural language interactions. Our framework leverages recent advances in foundation models, both large language models and vision-language models, to carve out the concept space through conversation and by automatically labeling training data points. Most importantly, our framework eliminates the need for crowd-sourced annotations. Moreover, our framework ultimately produces lightweight classification models that are deployable in cost-sensitive scenarios. Across 15 subjective concepts and across 2 public image classification datasets, our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models like ALIGN, CLIP, CuPL, and large visual question-answering models like PaLI-X.
♻ ☆ Prediction Error Estimation in Random Forests
In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which are given for logistic regression. We further show that our result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.
comment: arXiv admin note: text overlap with arXiv:2104.00673 by other authors
♻ ☆ Unraveling Privacy Risks of Individual Fairness in Graph Neural Networks ICDE
Graph neural networks (GNNs) have gained significant attraction due to their expansive real-world applications. To build trustworthy GNNs, two aspects - fairness and privacy - have emerged as critical considerations. Previous studies have separately examined the fairness and privacy aspects of GNNs, revealing their trade-off with GNN performance. Yet, the interplay between these two aspects remains unexplored. In this paper, we pioneer the exploration of the interaction between the privacy risks of edge leakage and the individual fairness of a GNN. Our theoretical analysis unravels that edge privacy risks unfortunately escalate when the nodes' individual fairness improves. Such an issue hinders the accomplishment of privacy and fairness of GNNs at the same time. To balance fairness and privacy, we carefully introduce fairness-aware loss reweighting based on influence function and privacy-aware graph structure perturbation modules within a fine-tuning mechanism. Experimental results underscore the effectiveness of our approach in achieving GNN fairness with limited performance compromise and controlled privacy risks. This work contributes to the comprehensively developing trustworthy GNNs by simultaneously addressing both fairness and privacy aspects.
comment: Accepted by IEEE International Conference on Data Engineering (ICDE) 2024
♻ ☆ Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions EACL 2023
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.
comment: EACL 2023 (Outstanding Paper Award)
♻ ☆ Policy Bifurcation in Safe Reinforcement Learning
Safe reinforcement learning (RL) offers advanced solutions to constrained optimal control problems. Existing studies in safe RL implicitly assume continuity in policy functions, where policies map states to actions in a smooth, uninterrupted manner; however, our research finds that in some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of policy bifurcation in safe RL, which corresponds to the contractibility of the reachable tuple. Our theorem reveals that in scenarios where the obstacle-free state space is non-simply connected, a feasible policy is required to be bifurcated, meaning its output action needs to change abruptly in response to the varying state. To train such a bifurcated policy, we propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output. The bifurcated behavior can be achieved by selecting the Gaussian component with the highest mixing coefficient. Besides, MUPO also integrates spectral normalization and forward KL divergence to enhance the policy's capability of exploring different modes. Experiments with vehicle control tasks show that our algorithm successfully learns the bifurcated policy and ensures satisfying safety, while a continuous policy suffers from inevitable constraint violations.
♻ ☆ An Aligning and Training Framework for Multimodal Recommendations
With the development of multimedia applications, multimodal recommendations are playing an essential role, as they can leverage rich contexts beyond user interactions. Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID features; however, there exist semantic gaps among multimodal content features and ID features, for which directly using multimodal information as an auxiliary would lead to misalignment in representations of users and items. In this paper, we first systematically investigate the misalignment issue in multimodal recommendations, and propose a solution named AlignRec. In AlignRec, the recommendation objective is decomposed into three alignments, namely alignment within contents, alignment between content and categorical ID, and alignment between users and items. Each alignment is characterized by a specific objective function and is integrated into our multimodal recommendation framework. To effectively train our AlignRec, we propose starting from pre-training the first alignment to obtain unified multimodal features and subsequently training the following two alignments together with these features as input. As it is essential to analyze whether each multimodal feature helps in training, we design three new classes of metrics to evaluate intermediate performance. Our extensive experiments on three real-world datasets consistently verify the superiority of AlignRec compared to nine baselines. We also find that the multimodal features generated by AlignRec are better than currently used ones, which are to be open-sourced.
comment: 11 pages, add some necessary explanations, revise typos
♻ ☆ Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective VLDB 2024
Many Graph Neural Network (GNN) training systems have emerged recently to support efficient GNN training. Since GNNs embody complex data dependencies between training samples, the training of GNNs should address distinct challenges different from DNN training in data management, such as data partitioning, batch preparation for mini-batch training, and data transferring between CPUs and GPUs. These factors, which take up a large proportion of training time, make data management in GNN training more significant. This paper reviews GNN training from a data management perspective and provides a comprehensive analysis and evaluation of the representative approaches. We conduct extensive experiments on various benchmark datasets and show many interesting and valuable results. We also provide some practical tips learned from these experiments, which are helpful for designing GNN training systems in the future.
comment: 12 pages, 18 figures. (Accepted by VLDB 2024)
♻ ☆ PAGE: Prototype-Based Model-Level Explanations for Graph Neural Networks AAAI-22
Aside from graph neural networks (GNNs) attracting significant attention as a powerful framework revolutionizing graph representation learning, there has been an increasing demand for explaining GNN models. Although various explanation methods for GNNs have been developed, most studies have focused on instance-level explanations, which produce explanations tailored to a given graph instance. In our study, we propose Prototype-bAsed GNN-Explainer (PAGE), a novel model-level GNN explanation method that explains what the underlying GNN model has learned for graph classification by discovering human-interpretable prototype graphs. Our method produces explanations for a given class, thus being capable of offering more concise and comprehensive explanations than those of instance-level explanations. First, PAGE selects embeddings of class-discriminative input graphs on the graph-level embedding space after clustering them. Then, PAGE discovers a common subgraph pattern by iteratively searching for high matching node tuples using node-level embeddings via a prototype scoring function, thereby yielding a prototype graph as our explanation. Using six graph classification datasets, we demonstrate that PAGE qualitatively and quantitatively outperforms the state-of-the-art model-level explanation method. We also carry out systematic experimental studies by demonstrating the relationship between PAGE and instance-level explanation methods, the robustness of PAGE to input data scarce environments, and the computational efficiency of the proposed prototype scoring function in PAGE.
comment: 18 pages, 12 figures, 5 tables; to appear in the IEEE Transactions on Pattern Analysis and Machine Intelligence (Please cite our journal version that will appear in an upcoming issue. Its two-page extended summary was presented in the AAAI-22 Student Abstract and Poster Program.)
♻ ☆ BOBA: Byzantine-Robust Federated Learning with Label Skewness AISTATS 2024
In federated learning, most existing robust aggregation rules (AGRs) combat Byzantine attacks in the IID setting, where client data is assumed to be independent and identically distributed. In this paper, we address label skewness, a more realistic and challenging non-IID setting, where each client only has access to a few classes of data. In this setting, state-of-the-art AGRs suffer from selection bias, leading to significant performance drop for particular classes; they are also more vulnerable to Byzantine attacks due to the increased variation among gradients of honest clients. To address these limitations, we propose an efficient two-stage method named BOBA. Theoretically, we prove the convergence of BOBA with an error of the optimal order. Our empirical evaluations demonstrate BOBA's superior unbiasedness and robustness across diverse models and datasets when compared to various baselines. Our code is available at https://github.com/baowenxuan/BOBA .
comment: Accepted by AISTATS 2024
♻ ☆ Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity AISTATS 2024
Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip.
comment: AISTATS 2024, Code: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip
♻ ☆ Machine-learning optimized measurements of chaotic dynamical systems via the information bottleneck
Deterministic chaos permits a precise notion of a "perfect measurement" as one that, when obtained repeatedly, captures all of the information created by the system's evolution with minimal redundancy. Finding an optimal measurement is challenging, and has generally required intimate knowledge of the dynamics in the few cases where it has been done. We establish an equivalence between a perfect measurement and a variant of the information bottleneck. As a consequence, we can employ machine learning to optimize measurement processes that efficiently extract information from trajectory data. We obtain approximately optimal measurements for multiple chaotic maps and lay the necessary groundwork for efficient information extraction from general time series.
comment: Project page: https://distributed-information-bottleneck.github.io
♻ ☆ RL in Markov Games with Independent Function Approximation: Improved Sample Complexity Bound under the Local Access Model AISTATS 2024
Efficiently learning equilibria with large state and action spaces in general-sum Markov games while overcoming the curse of multi-agency is a challenging problem. Recent works have attempted to solve this problem by employing independent linear function classes to approximate the marginal $Q$-value for each agent. However, existing sample complexity bounds under such a framework have a suboptimal dependency on the desired accuracy $\varepsilon$ or the action space. In this work, we introduce a new algorithm, Lin-Confident-FTRL, for learning coarse correlated equilibria (CCE) with local access to the simulator, i.e., one can interact with the underlying environment on the visited states. Up to a logarithmic dependence on the size of the state space, Lin-Confident-FTRL learns $\epsilon$-CCE with a provable optimal accuracy bound $O(\epsilon^{-2})$ and gets rids of the linear dependency on the action space, while scaling polynomially with relevant problem parameters (such as the number of agents and time horizon). Moreover, our analysis of Linear-Confident-FTRL generalizes the virtual policy iteration technique in the single-agent local planning literature, which yields a new computationally efficient algorithm with a tighter sample complexity bound when assuming random access to the simulator.
comment: Accepted at the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)
♻ ☆ BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets
Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset, and evaluate how different offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. More specifically, Baffle modifies 10\% of the datasets for four tasks (3 robotic controls and 1 autonomous driving). Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by 63.2\%, 53.9\%, 64.7\%, and 47.4\% in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.
comment: Accepted at IEEE S&P (Oakland) 2024
♻ ☆ A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization
Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus lifting the tail and preventing flatness. Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains. Specifically, it excels in identifying risk-averse CVaR policies in some Mujoco environments where the traditional CVaR-PG fails to learn a reasonable policy.
♻ ☆ Malaria Parasitic Detection using a New Deep Boosted and Ensemble Learning Framework
Malaria is a potentially fatal plasmodium parasite injected by female anopheles mosquitoes that infect red blood cells and millions worldwide yearly. However, specialists' manual screening in clinical practice is laborious and prone to error. Therefore, a novel Deep Boosted and Ensemble Learning (DBEL) framework, comprising the stacking of new Boosted-BR-STM convolutional neural networks (CNN) and the ensemble ML classifiers, is developed to screen malaria parasite images. The proposed Boosted-BR-STM is based on a new dilated-convolutional block-based split transform merge (STM) and feature-map Squeezing-Boosting (SB) ideas. Moreover, the new STM block uses regional and boundary operations to learn the malaria parasite's homogeneity, heterogeneity, and boundary with patterns. Furthermore, the diverse boosted channels are attained by employing Transfer Learning-based new feature-map SB in STM blocks at the abstract, medium, and conclusion levels to learn minute intensity and texture variation of the parasitic pattern. The proposed DBEL framework implicates the stacking of prominent and diverse boosted channels and provides the generated discriminative features of the developed Boosted-BR-STM to the ensemble of ML classifiers. The proposed framework improves the discrimination ability and generalization of ensemble learning. Moreover, the deep feature spaces of the developed Boosted-BR-STM and customized CNNs are fed into ML classifiers for comparative analysis. The proposed DBEL framework outperforms the existing techniques on the NIH malaria dataset that are enhanced using discrete wavelet transform to enrich feature space. The proposed DBEL framework achieved Accuracy (98.50%), Sensitivity (0.9920), F-score (0.9850), and AUC (0.997), which suggest it to be utilized for malaria parasite screening.
comment: 26 pages, 10 figures, 9 Tables
♻ ☆ Precipitation Downscaling with Spatiotemporal Video Diffusion
In climate science and meteorology, high-resolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulation-based methods. Statistical downscaling, or super-resolution, is a common workaround where a low-resolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of high-resolution given low-resolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation super-resolution, employing a deterministic downscaler followed by a temporally-conditioned diffusion model to capture noise characteristics and high-frequency patterns. We test our approach on FV3GFS output, an established large-scale global atmosphere model, and compare it against five state-of-the-art baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for data-driven precipitation downscaling.
Cryptography and Security
☆ Quantum-Secure Certificate-Less Conditional Privacy-Preserving Authentication for VANET SC
Vehicular Ad-hoc Networks (VANETs) marked a pronounced change in the Intelligent Transport System and Smart Cities through seamless vehicle communication to intensify safety and efficacy. However, a few authentication schemes have been devised in the literature to ensure the authenticity of the source and information in the post-quantum era. The most popular base for such construction is lattice-based cryptography. However, existing lattice-based authentication schemes fall short of addressing the potential challenges of the leakage of the master secret key and key-escrow problem. By ingeniously addressing both issues, the paper proposes the \emph{first} quantum secure authentication scheme to eliminate the flaws while maintaining the system's overall efficiency intact. Compared to the state-of-the-art schemes, the provable security and overall performance assessment highlight the suitability of the proposed approach.
comment: Paper submitted to IEEE TDSC under review
☆ Statistical Confidence in Mining Power Estimates for PoW Blockchains
The security of blockchain systems depends on the distribution of mining power across participants. If sufficient mining power is controlled by one entity, they can force their own version of events. This may allow them to double spend coins, for example. For Proof of Work (PoW) blockchains, however, the distribution of mining power cannot be read directly from the blockchain and must instead be inferred from the number of blocks mined in a specific sample window. We introduce a framework to quantify this statistical uncertainty for the Nakamoto coefficient, which is a commonly-used measure of blockchain decentralization. We show that aggregating blocks over a day can lead to considerable uncertainty, with Bitcoin failing more than half the hypothesis tests ({\alpha} = 0.05) when using a daily granularity. For these reasons, we recommend that blocks are aggregated over a sample window of at least 7 days. Instead of reporting a single value, our approach produces a range of possible Nakamoto coefficient values that have statistical support at a particular significance level {\alpha}.
☆ Threats, Attacks, and Defenses in Machine Unlearning: A Survey
Recently, Machine Unlearning (MU) has gained considerable attention for its potential to improve AI safety by removing the influence of specific data from trained Machine Learning (ML) models. This process, known as knowledge removal, addresses concerns about data such as sensitivity, copyright restrictions, obsolescence, or low quality. This capability is also crucial for ensuring compliance with privacy regulations such as the Right To Be Forgotten (RTBF). Therefore, strategic knowledge removal mitigates the risk of harmful outcomes, safeguarding against biases, misinformation, and unauthorized data exploitation, thereby enhancing the ethical use and reliability of AI systems. Efforts have been made to design efficient unlearning approaches, with MU services being examined for integration with existing machine learning as a service (MLaaS), allowing users to submit requests to erase data. However, recent research highlights vulnerabilities in machine unlearning systems, such as information leakage and malicious unlearning requests, that can lead to significant security and privacy concerns. Moreover, extensive research indicates that unlearning methods and prevalent attacks fulfill diverse roles within MU systems. For instance, unlearning can act as a mechanism to recover models from backdoor attacks, while backdoor attacks themselves can serve as an evaluation metric for unlearning effectiveness. This underscores the intricate relationship and complex interplay between these elements in maintaining system functionality and safety. Therefore, this survey seeks to bridge the gap between the extensive number of studies on threats, attacks, and defenses in machine unlearning and the absence of a comprehensive review that categorizes their taxonomy, methods, and solutions, thus offering valuable insights for future research directions and practical implementations.
☆ DL2Fence: Integrating Deep Learning and Frame Fusion for Enhanced Detection and Localization of Refined Denial-of-Service in Large-Scale NoCs
This study introduces a refined Flooding Injection Rate-adjustable Denial-of-Service (DoS) model for Network-on-Chips (NoCs) and more importantly presents DL2Fence, a novel framework utilizing Deep Learning (DL) and Frame Fusion (2F) for DoS detection and localization. Two Convolutional Neural Networks models for classification and segmentation were developed to detect and localize DoS respectively. It achieves detection and localization accuracies of 95.8\% and 91.7\%, and precision rates of 98.5\% and 99.3\% in a 16x16 mesh NoC. The framework's hardware overhead notably decreases by 76.3\% when scaling from 8x8 to 16x16 NoCs, and it requires 42.4\% less hardware compared to state-of-the-arts. This advancement demonstrates DL2Fence's effectiveness in balancing outstanding detection performance in large-scale NoCs with extremely low hardware overhead.
☆ Have You Poisoned My Data? Defending Neural Networks against Data Poisoning ESORICS
The unprecedented availability of training data fueled the rapid development of powerful neural networks in recent years. However, the need for such large amounts of data leads to potential threats such as poisoning attacks: adversarial manipulations of the training data aimed at compromising the learned model to achieve a given adversarial goal. This paper investigates defenses against clean-label poisoning attacks and proposes a novel approach to detect and filter poisoned datapoints in the transfer learning setting. We define a new characteristic vector representation of datapoints and show that it effectively captures the intrinsic properties of the data distribution. Through experimental analysis, we demonstrate that effective poisons can be successfully differentiated from clean points in the characteristic vector space. We thoroughly evaluate our proposed approach and compare it to existing state-of-the-art defenses using multiple architectures, datasets, and poison budgets. Our evaluation shows that our proposal outperforms existing approaches in defense rate and final trained model performance across all experimental settings.
comment: Paper accepted for publication at European Symposium on Research in Computer Security (ESORICS) 2024
☆ The Mediterraneus Protocol: building an SSI native decentralised ecosystem of digital services
This paper presents, for the first time, the Mediterraneous protocol. It is designed to support the development of an Internet of digital services, owned by their creators, and consumed by users by presenting their decentralised digital identity and a proof of service purchase. Mediterraneous is Self-Sovereign Identity (SSI) native, integrating the SSI model at the core of its working principles to overcome the limitations resulting from using pseudonyms and centralised access control of existing Web3 solutions.
☆ Adversarial Attacks and Defenses in Automated Control Systems: A Comprehensive Benchmark
Integrating machine learning into Automated Control Systems (ACS) enhances decision-making in industrial process management. One of the limitations to the widespread adoption of these technologies in industry is the vulnerability of neural networks to adversarial attacks. This study explores the threats in deploying deep learning models for fault diagnosis in ACS using the Tennessee Eastman Process dataset. By evaluating three neural networks with different architectures, we subject them to six types of adversarial attacks and explore five different defense methods. Our results highlight the strong vulnerability of models to adversarial samples and the varying effectiveness of defense strategies. We also propose a novel protection approach by combining multiple defense methods and demonstrate it's efficacy. This research contributes several insights into securing machine learning within ACS, ensuring robust fault diagnosis in industrial processes.
☆ Secure Query Processing with Linear Complexity
We present LINQ, the first join protocol with linear complexity (in both running time and communication) under the secure multi-party computation model (MPC). It can also be extended to support all free-connex queries, a large class of select-join-aggregate queries, still with linear complexity. This matches the plaintext result for the query processing problem, as free-connex queries are the largest class of queries known to be solvable in linear time in plaintext. We have then built a query processing system based on LINQ, and the experimental results show that LINQ significantly outperforms the state of the art. For example, it can finish a query on three relations with an output size of 1 million tuples in around 100s in the LAN setting, while existing protocols that support the query cannot finish in an hour. Thus LINQ brings MPC query processing closer to practicality.
☆ Byzantine-resilient Federated Learning With Adaptivity to Data Heterogeneity
This paper deals with federated learning (FL) in the presence of malicious Byzantine attacks and data heterogeneity. A novel Robust Average Gradient Algorithm (RAGA) is proposed, which leverages the geometric median for aggregation and can freely select the round number for local updating. Different from most existing resilient approaches, which perform convergence analysis based on strongly-convex loss function or homogeneously distributed dataset, we conduct convergence analysis for not only strongly-convex but also non-convex loss function over heterogeneous dataset. According to our theoretical analysis, as long as the fraction of dataset from malicious users is less than half, RAGA can achieve convergence at rate $\mathcal{O}({1}/{T^{2/3- \delta}})$ where $T$ is the iteration number and $\delta \in (0, 2/3)$ for non-convex loss function, and at linear rate for strongly-convex loss function. Moreover, stationary point or global optimal solution is proved to obtainable as data heterogeneity vanishes. Experimental results corroborate the robustness of RAGA to Byzantine attacks and verifies the advantage of RAGA over baselines on convergence performance under various intensity of Byzantine attacks, for heterogeneous dataset.
☆ BadEdit: Backdooring large language models by model editing ICLR 2024
Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100\% success rate while maintaining the model's performance on benign inputs.
comment: ICLR 2024
☆ Local Approximation of Secrecy Capacity
This paper uses Euclidean Information Theory (EIT) to analyze the wiretap channel. We investigate a scenario of efficiently transmitting a small amount of information subject to compression rate and secrecy constraints. We transform the information-theoretic problem into a linear algebra problem and obtain the perturbed probability distributions such that secrecy is achievable. Local approximations are being used in order to obtain an estimate of the secrecy capacity by solving a generalized eigenvalue problem.
comment: Submitted to EUSIPCO 2024
☆ Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal
The rapid integration of Large Language Models (LLMs) across diverse sectors has marked a transformative era, showcasing remarkable capabilities in text generation and problem-solving tasks. However, this technological advancement is accompanied by significant risks and vulnerabilities. Despite ongoing security enhancements, attackers persistently exploit these weaknesses, casting doubts on the overall trustworthiness of LLMs. Compounding the issue, organisations are deploying LLM-integrated systems without understanding the severity of potential consequences. Existing studies by OWASP and MITRE offer a general overview of threats and vulnerabilities but lack a method for directly and succinctly analysing the risks for security practitioners, developers, and key decision-makers who are working with this novel technology. To address this gap, we propose a risk assessment process using tools like the OWASP risk rating methodology which is used for traditional systems. We conduct scenario analysis to identify potential threat agents and map the dependent system components against vulnerability factors. Through this analysis, we assess the likelihood of a cyberattack. Subsequently, we conduct a thorough impact analysis to derive a comprehensive threat matrix. We also map threats against three key stakeholder groups: developers engaged in model fine-tuning, application developers utilizing third-party APIs, and end users. The proposed threat matrix provides a holistic evaluation of LLM-related risks, enabling stakeholders to make informed decisions for effective mitigation strategies. Our outlined process serves as an actionable and comprehensive tool for security practitioners, offering insights for resource management and enhancing the overall system security.
comment: 10 pages, 1 figure, 3 tables
☆ Private Aggregate Queries to Untrusted Databases
Private information retrieval (PIR), a privacy-preserving cryptographic tool, solves a simplified version of this problem by hiding the database item that a client accesses. Most PIR protocols require the client to know the exact row index of the intended database item, which cannot support the complicated aggregation-based statistical query in a similar setting. Some works in the PIR space contain keyword searching and SQL-like queries, but most need multiple interactions between the PIR client and PIR servers. Some schemes support searching SQL-like expressive queries in a single round but fail to enable aggregate queries. These schemes are the main focus of this paper. To bridge the gap, we have built a general-purpose novel information-theoretic PIR (IT-PIR) framework that permits a user to fetch the aggregated result, hiding all sensitive sections of the complex query from the hosting PIR server in a single round of interaction. In other words, the server will not know which records contribute to the aggregation. We then evaluate the feasibility of our protocol for both benchmarking and real-world application settings. For instance, in a complex aggregate query to the Twitter microblogging database of 1 million tweets, our protocol takes 0.014 seconds for a PIR server to generate the result when the user is interested in one of 3K user handles. In contrast, for a much-simplified task, not an aggregate but a positional query, Goldberg's regular IT-PIR (Oakland 2007) takes 1.13 seconds. For all possible user handles, 300K, it takes equal time compared to the regular IT-PIR. This example shows that complicated aggregate queries through our framework do not incur additional overhead if not less, compared to the conventional query.
☆ Graph Attention Network-based Block Propagation with Optimal AoI and Reputation in Web 3.0
Web 3.0 is recognized as a pioneering paradigm that empowers users to securely oversee data without reliance on a centralized authority. Blockchains, as a core technology to realize Web 3.0, can facilitate decentralized and transparent data management. Nevertheless, the evolution of blockchain-enabled Web 3.0 is still in its nascent phase, grappling with challenges such as ensuring efficiency and reliability to enhance block propagation performance. In this paper, we design a Graph Attention Network (GAT)-based reliable block propagation optimization framework for blockchain-enabled Web 3.0. We first innovatively apply a data-freshness metric called age of information to measure block propagation efficiency in public blockchains. To achieve the reliability of block propagation, we introduce a reputation mechanism based on the subjective logic model, including the local and recommended opinions to calculate the miner reputation value. Moreover, considering that the GAT possesses the excellent ability to process graph-structured data, we utilize the GAT with reinforcement learning to obtain the optimal block propagation trajectory. Numerical results demonstrate that the proposed scheme exhibits the most outstanding block propagation efficiency and reliability compared with traditional routing algorithms.
☆ A system capable of verifiably and privately screening global DNA synthesis
Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't need to quickly update printers to deal with newly discovered currencies, whereas we regularly learn of new viruses and other biological threats. Second, anti-counterfeiting specifications on a local printer can't be extracted and misused by malicious actors, unlike information on biological threats. Finally, any screening must keep the inspected DNA sequences private, as they may constitute valuable trade secrets. Here we describe SecureDNA, a free, privacy-preserving, and fully automated system capable of verifiably screening all DNA synthesis orders of 30+ base pairs against an up-to-date database of hazards, and its operational performance and specificity when applied to 67 million base pairs of DNA synthesized by providers in the United States, Europe, and China.
comment: Main text 10 pages, 4 figures. 5 supplementary figures. Total 21 pages. Direct correspondence to: Ivan B. Damgard (ivan@cs.au.dk), Andrew C. Yao (andrewcyao@mail.tsinghua.edu.cn), Kevin M. Esvelt (esvelt@mit.edu)
☆ Zero-Knowledge Proof of Distinct Identity: a Standard-compatible Sybil-resistant Pseudonym Extension for C-ITS
Pseudonyms are widely used in Cooperative Intelligent Transport Systems (C-ITS) to protect the location privacy of vehicles. However, the unlinkability nature of pseudonyms also enables Sybil attacks, where a malicious vehicle can pretend to be multiple vehicles at the same time. In this paper, we propose a novel protocol called zero-knowledge Proof of Distinct Identity (zk-PoDI,) which allows a vehicle to prove that it is not the owner of another pseudonym in the local area, without revealing its actual identity. Zk-PoDI is based on the Diophantine equation and zk-SNARK, and does not rely on any specific pseudonym design or infrastructure assistance. We show that zk-PoDI satisfies all the requirements for a practical Sybil-resistance pseudonym system, and it has low latency, adjustable difficulty, moderate computation overhead, and negligible communication cost. We also discuss the future work of implementing and evaluating zk-PoDI in a realistic city-scale simulation environment.
☆ A Signal Injection Attack Against Zero Involvement Pairing and Authentication for the Internet of Things
Zero Involvement Pairing and Authentication (ZIPA) is a promising technique for autoprovisioning large networks of Internet-of-Things (IoT) devices. In this work, we present the first successful signal injection attack on a ZIPA system. Most existing ZIPA systems assume there is a negligible amount of influence from the unsecured outside space on the secured inside space. In reality, environmental signals do leak from adjacent unsecured spaces and influence the environment of the secured space. Our attack takes advantage of this fact to perform a signal injection attack on the popular Schurmann & Sigg algorithm. The keys generated by the adversary with a signal injection attack at 95 dBA is within the standard error of the legitimate device.
☆ Shortchanged: Uncovering and Analyzing Intimate Partner Financial Abuse in Consumer Complaints CHI '24
Digital financial services can introduce new digital-safety risks for users, particularly survivors of intimate partner financial abuse (IPFA). To offer improved support for such users, a comprehensive understanding of their support needs and the barriers they face to redress by financial institutions is essential. Drawing from a dataset of 2.7 million customer complaints, we implement a bespoke workflow that utilizes language-modeling techniques and expert human review to identify complaints describing IPFA. Our mixed-method analysis provides insight into the most common digital financial products involved in these attacks, and the barriers consumers report encountering when doing so. Our contributions are twofold; we offer the first human-labeled dataset for this overlooked harm and provide practical implications for technical practice, research, and design for better supporting and protecting survivors of IPFA.
comment: 20 pages, 9 figures, 8 tables, This paper will be published in CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems
☆ Jailbreaking is Best Solved by Definition
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing the output of undesirable responses. In this work, we critically examine the two stages of the defense pipeline: (i) the definition of what constitutes unsafe outputs, and (ii) the enforcement of the definition via methods such as input processing or fine-tuning. We cast severe doubt on the efficacy of existing enforcement mechanisms by showing that they fail to defend even for a simple definition of unsafe outputs--outputs that contain the word "purple". In contrast, post-processing outputs is perfectly robust for such a definition. Drawing on our results, we present our position that the real challenge in defending jailbreaks lies in obtaining a good definition of unsafe responses: without a good definition, no enforcement strategy can succeed, but with a good definition, output processing already serves as a robust baseline albeit with inference-time overheads.
☆ Six Levels of Privacy: A Framework for Financial Synthetic Data
Synthetic Data is increasingly important in financial applications. In addition to the benefits it provides, such as improved financial modeling and better testing procedures, it poses privacy risks as well. Such data may arise from client information, business information, or other proprietary sources that must be protected. Even though the process by which Synthetic Data is generated serves to obscure the original data to some degree, the extent to which privacy is preserved is hard to assess. Accordingly, we introduce a hierarchy of ``levels'' of privacy that are useful for categorizing Synthetic Data generation methods and the progressively improved protections they offer. While the six levels were devised in the context of financial applications, they may also be appropriate for other industries as well. Our paper includes: A brief overview of Financial Synthetic Data, how it can be used, how its value can be assessed, privacy risks, and privacy attacks. We close with details of the ``Six Levels'' that include defenses against those attacks.
comment: Six privacy levels framework; excerpted from "Synthetic Data Applications in Finance'' (arxiv:2401.00081) article
☆ Defending Against Indirect Prompt Injection Attacks With Spotlighting
Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.
♻ ☆ Contemplating Secure and Optimal Design Practices for Information Infrastructure From a Human Factors Perspective
Designing secure information infrastructure is a function of design and usability. However, security is seldom given priority when systems are being developed. Secure design practices should balance between functionality (i.e., proper design) to meet minimum requirements and user-friendliness. Design recommendations such as those with a user-centric approach (i.e., inclusive of only relevant information, user liberty) and presenting information within its proper context in a clear and engaging manner has been scientifically shown to improve user response and experience.
comment: This version is one of the final drafts and is being revised. Newer versions will be uploaded as major changes are incorporated
♻ ☆ On the Privacy Effect of Data Enhancement via the Lens of Memorization
Machine learning poses severe privacy concerns as it has been shown that the learned models can reveal sensitive information about their training data. Many works have investigated the effect of widely adopted data augmentation and adversarial training techniques, termed data enhancement in the paper, on the privacy leakage of machine learning models. Such privacy effects are often measured by membership inference attacks (MIAs), which aim to identify whether a particular example belongs to the training set or not. We propose to investigate privacy from a new perspective called memorization. Through the lens of memorization, we find that previously deployed MIAs produce misleading results as they are less likely to identify samples with higher privacy risks as members compared to samples with low privacy risks. To solve this problem, we deploy a recent attack that can capture individual samples' memorization degrees for evaluation. Through extensive experiments, we unveil several findings about the connections between three essential properties of machine learning models, including privacy, generalization gap, and adversarial robustness. We demonstrate that the generalization gap and privacy leakage are less correlated than those of the previous results. Moreover, there is not necessarily a trade-off between adversarial robustness and privacy as stronger adversarial robustness does not make the model more susceptible to privacy attacks.
comment: Accepted by IEEE TIFS, 17 pages
♻ ☆ Vulnerability analysis of captcha using Deep learning
Several websites improve their security and avoid dangerous Internet attacks by implementing CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), a type of verification to identify whether the end-user is human or a robot. The most prevalent type of CAPTCHA is text-based, designed to be easily recognized by humans while being unsolvable towards machines or robots. However, as deep learning technology progresses, development of convolutional neural network (CNN) models that predict text-based CAPTCHAs becomes easier. The purpose of this research is to investigate the flaws and vulnerabilities in the CAPTCHA generating systems in order to design more resilient CAPTCHAs. To achieve this, we created CapNet, a Convolutional Neural Network. The proposed platform can evaluate both numerical and alphanumerical CAPTCHAs
♻ ☆ LogPrécis: Unleashing Language Models for Automated Shell Log Analysis
The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPr\'ecis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPr\'ecis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPr\'ecis, released as open source, paves the way for better and more responsive defense against cyberattacks.
comment: 18 pages, Computer&Security (https://www.sciencedirect.com/science/article/pii/S0167404824001068), code available at https://github.com/SmartData-Polito/logprecis, models available at https://huggingface.co/SmartDataPolito
♻ ☆ Exploring the Privacy-Energy Consumption Tradeoff for Split Federated Learning
Split Federated Learning (SFL) has recently emerged as a promising distributed learning technology, leveraging the strengths of both federated and split learning. It emphasizes the advantages of rapid convergence while addressing privacy concerns. As a result, this innovation has received significant attention from both industry and academia. However, since the model is split at a specific layer, known as a cut layer, into both client-side and server-side models for the SFL, the choice of the cut layer in SFL can have a substantial impact on the energy consumption of clients and their privacy, as it influences the training burden and the output of the client-side models. In this article, we provide a comprehensive overview of the SFL process and thoroughly analyze energy consumption and privacy. This analysis considers the influence of various system parameters on the cut layer selection strategy. Additionally, we provide an illustrative example of the cut layer selection, aiming to minimize clients' risk of reconstructing the raw data at the server while sustaining energy consumption within the required energy budget, which involves trade-offs. Finally, we address open challenges in this field. These directions represent promising avenues for future research and development.
comment: 10 pages, 5 figures
♻ ☆ Contingency Analyses with Warm Starter using Probabilistic Graphical Model
Cyberthreats are an increasingly common risk to the power grid and can thwart secure grid operations. We propose to extend contingency analysis to include cyberthreat evaluations. However, unlike the traditional N-1 or N-2 contingencies, cyberthreats (e.g., MadIoT) require simulating hard-to-solve N-k (with k >> 2) contingencies in a practical amount of time. Purely physics-based power flow solvers, while being accurate, are slow and may not solve N-k contingencies in a timely manner, whereas the emerging data-driven alternatives are fast but not sufficiently generalizable, interpretable, and scalable. To address these challenges, we propose a novel conditional Gaussian Random Field-based data-driven method that performs fast and accurate evaluation of cyberthreats. It achieves speedup of contingency analysis by warm-starting simulations, i.e., improving starting points, for the physical solvers. To improve the physical interpretability and generalizability, the proposed method incorporates domain knowledge by considering the graphical nature of the grid topology. To improve scalability, the method applies physics-informed regularization that reduces model complexity. Experiments validate that simulating MadIoT-induced attacks with our warm starter becomes approximately 5x faster on a realistic 2000-bus system.
comment: arXiv admin note: substantial text overlap with arXiv:2205.03673
♻ ☆ BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets
Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset, and evaluate how different offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. More specifically, Baffle modifies 10\% of the datasets for four tasks (3 robotic controls and 1 autonomous driving). Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by 63.2\%, 53.9\%, 64.7\%, and 47.4\% in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.
comment: Accepted at IEEE S&P (Oakland) 2024
♻ ☆ Advancing Beyond Identification: Multi-bit Watermark for Large Language Models NAACL 2024
We show the viability of tackling misuses of large language models beyond the identification of machine-generated text. While existing zero-bit watermark methods focus on detection only, some malicious misuses demand tracing the adversary user for counteracting them. To address this, we propose Multi-bit Watermark via Position Allocation, embedding traceable multi-bit information during language model generation. Through allocating tokens onto different parts of the messages, we embed longer messages in high corruption settings without added latency. By independently embedding sub-units of messages, the proposed method outperforms the existing works in terms of robustness and latency. Leveraging the benefits of zero-bit watermarking, our method enables robust extraction of the watermark without any model access, embedding and extraction of long messages ($\geq$ 32-bit) without finetuning, and maintaining text quality, while allowing zero-bit detection all at the same time. Code is released here: https://github.com/bangawayoo/mb-lm-watermarking
comment: NAACL 2024 main. 9 pages and appendix
♻ ☆ Trojan Playground: A Reinforcement Learning Framework for Hardware Trojan Insertion and Detection
Current Hardware Trojan (HT) detection techniques are mostly developed based on a limited set of HT benchmarks. Existing HT benchmark circuits are generated with multiple shortcomings, i.e., i) they are heavily biased by the designers' mindset when created, and ii) they are created through a one-dimensional lens, mainly the signal activity of nets. We introduce the first automated Reinforcement Learning (RL) HT insertion and detection framework to address these shortcomings. In the HT insertion phase, an RL agent explores the circuits and finds locations best for keeping inserted HTs hidden. On the defense side, we introduce a multi-criteria RL-based HT detector that generates test vectors to discover the existence of HTs. Using the proposed framework, one can explore the HT insertion and detection design spaces to break the limitations of human mindset and benchmark issues, ultimately leading toward the next generation of innovative detectors. We demonstrate the efficacy of our framework on ISCAS-85 benchmarks, provide the attack and detection success rates, and define a methodology for comparing our techniques.
comment: This paper appears in the Journal of Supercomputing: https://doi.org/10.1007/s11227-024-05963-8
♻ ☆ A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Large Language Models (LLMs), such as ChatGPT and Bard, have revolutionized natural language understanding and generation. They possess deep language comprehension, human-like text generation capabilities, contextual awareness, and robust problem-solving skills, making them invaluable in various domains (e.g., search engines, customer support, translation). In the meantime, LLMs have also gained traction in the security community, revealing security vulnerabilities and showcasing their potential in security-related tasks. This paper explores the intersection of LLMs with security and privacy. Specifically, we investigate how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs. Through a comprehensive literature review, the paper categorizes the papers into "The Good" (beneficial LLM applications), "The Bad" (offensive applications), and "The Ugly" (vulnerabilities of LLMs and their defenses). We have some interesting findings. For example, LLMs have proven to enhance code security (code vulnerability detection) and data privacy (data confidentiality protection), outperforming traditional methods. However, they can also be harnessed for various attacks (particularly user-level attacks) due to their human-like reasoning abilities. We have identified areas that require further research efforts. For example, Research on model and parameter extraction attacks is limited and often theoretical, hindered by LLM parameter scale and confidentiality. Safe instruction tuning, a recent development, requires more exploration. We hope that our work can shed light on the LLMs' potential to both bolster and jeopardize cybersecurity.
Computation and Language
☆ Learning from Models and Data for Visual Grounding
We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%.
comment: Project Page: https://catherine-r-he.github.io/SynGround/
☆ ZigMa: Zigzag Mamba Diffusion Model
The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$. Code will be released at https://taohu.me/zigma/
comment: Project Page: https://taohu.me/zigma/
☆ Natural Language as Polices: Reasoning for Coordinate-Level Embodied Control with LLMs
We demonstrate experimental results with LLMs that address robotics action planning problems. Recently, LLMs have been applied in robotics action planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates action planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks.
comment: 8 pages, 2 figures
☆ Reverse Training to Nurse the Reversal Curse
Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse. Even when training with trillions of tokens this issue still appears due to Zipf's law - hence even if we train on the entire internet. This work proposes an alternative training scheme, called reverse training, whereby all words are used twice, doubling the amount of available tokens. The LLM is trained in both forward and reverse directions by reversing the training strings while preserving (i.e., not reversing) chosen substrings, such as entities. We show that data-matched reverse-trained models provide superior performance to standard models on standard tasks, and compute-matched reverse-trained models provide far superior performance on reversal tasks, helping resolve the reversal curse issue.
☆ Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts CHI 2024
Automatic coding patient behaviors is essential to support decision making for psychotherapists during the motivational interviewing (MI), a collaborative communication intervention approach to address psychiatric issues, such as alcohol and drug addiction. While the behavior coding task has rapidly adapted machine learning to predict patient states during the MI sessions, lacking of domain-specific knowledge and overlooking patient-therapist interactions are major challenges in developing and deploying those models in real practice. To encounter those challenges, we introduce the Chain-of-Interaction (CoI) prompting method aiming to contextualize large language models (LLMs) for psychiatric decision support by the dyadic interactions. The CoI prompting approach systematically breaks down the coding task into three key reasoning steps, extract patient engagement, learn therapist question strategies, and integrates dyadic interactions between patients and therapists. This approach enables large language models to leverage the coding scheme, patient state, and domain knowledge for patient behavioral coding. Experiments on real-world datasets can prove the effectiveness and flexibility of our prompting method with multiple state-of-the-art LLMs over existing prompting baselines. We have conducted extensive ablation analysis and demonstrate the critical role of dyadic interactions in applying LLMs for psychotherapy behavior understanding.
comment: Accepted to IEEE ICHI 2024
☆ Information-Theoretic Distillation for Reference-less Summarization
The current winning recipe for automatic summarization is using proprietary large-scale language models (LLMs) such as ChatGPT as is, or imitation learning from them as teacher models. While increasingly ubiquitous dependence on such large-scale language models is convenient, there remains an important question of whether small-scale models could have achieved competitive results, if we were to seek an alternative learning method -- that allows for a more cost-efficient, controllable, yet powerful summarizer. We present InfoSumm, a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization, without relying on either the LLM's capability or human-written references. To achieve this, we first propose a novel formulation of the desiderata of summarization (saliency, faithfulness and brevity) through the lens of mutual information between the original document and the summary. Based on this formulation, we start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization, then self-train the model to optimize for the information-centric measures of ideal summaries. Distilling from the improved teacher, we arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT, without ever relying on ChatGPT's capabilities. Extensive analysis demonstrates that our approach outperforms in-domain supervised models in human evaluation, let alone state-of-the-art unsupervised methods, and wins over ChatGPT in controllable summarization.
☆ Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that language model embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.
☆ EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation LREC
Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.
comment: Accepted at LREC-Coling 2024
☆ PARAMANU-AYN: An Efficient Novel Generative and Instruction-tuned Language Model for Indian Legal Case Documents
In this paper, we present PARAMANU-AYN, a language model based exclusively on case documents of the Supreme Court of India, the Constitution of India, and the Indian Penal Code. The novel Auto Regressive (AR) decoder based model is pretrained from scratch at a context size of 8192. We evaluated our pretrained legal model on perplexity metrics. We also instruction-tuned our pretrained model on a set of 10,763 instructions covering various legal tasks such as legal reasoning, judgement explanation, legal clause generation, legal drafting, legal contract drafting, case summarization, constitutional question-answering, etc. We also evaluated the responses of prompts for instruction-tuned models by GPT-3.5-Turbo on clarity, relevance, completeness, and legal reasoning metrics in a scale of 10. Our model can be run on CPU and achieved 42.46 tokens/sec CPU inference speed. We found that our models, despite not being pretrained on legal books, various legal contracts, and legal documents, were able to learn the domain knowledge required for drafting various legal contracts and legal clauses, and generalize to draft legal contracts and legal clauses with limited instruction tuning. Hence, we conclude that for a strong domain-specialized generative language model (such as legal), very large amounts of data are not required to develop models from scratch. We believe that this work is the first attempt to make a dedicated generative legal language model from scratch for Indian Supreme Court jurisdiction or in legal NLP overall. We plan to release our Paramanu-Ayn model at https://www.bharatgpts.com.
☆ RoleInteract: Evaluating the Social Interaction of Role-Playing Agents
Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.
☆ Grounding Spatial Relations in Text-Only Language Models
This paper shows that text-only Language Models (LM) can learn to ground spatial relations like "left of" or "below" if they are provided with explicit location information of objects and they are properly trained to leverage those locations. We perform experiments on a verbalized version of the Visual Spatial Reasoning (VSR) dataset, where images are coupled with textual statements which contain real or fake spatial relations between two objects of the image. We verbalize the images using an off-the-shelf object detector, adding location tokens to every object label to represent their bounding boxes in textual form. Given the small size of VSR, we do not observe any improvement when using locations, but pretraining the LM over a synthetic dataset automatically derived by us improves results significantly when using location tokens. We thus show that locations allow LMs to ground spatial relations, with our text-only LMs outperforming Vision-and-Language Models and setting the new state-of-the-art for the VSR dataset. Our analysis show that our text-only LMs can generalize beyond the relations seen in the synthetic dataset to some extent, learning also more useful information than that encoded in the spatial rules we used to create the synthetic dataset itself.
comment: Accepted in Neural Networks
☆ Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese
In this paper, we explore the utility of \textit{Translationese} as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56\% poorer on NLU tasks and 1.51\% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight \textit{TinyLMs} pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10\%) of clean data. We release the data we collected and created as a part of this work, \textit{IndicMonoDoc}, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.
☆ Llama meets EU: Investigating the European Political Spectrum through the Lens of LLMs NAACL 2024
Instruction-finetuned Large Language Models inherit clear political leanings that have been shown to influence downstream task performance. We expand this line of research beyond the two-party system in the US and audit Llama Chat in the context of EU politics in various settings to analyze the model's political knowledge and its ability to reason in context. We adapt, i.e., further fine-tune, Llama Chat on speeches of individual euro-parties from debates in the European Parliament to reevaluate its political leaning based on the EUandI questionnaire. Llama Chat shows considerable knowledge of national parties' positions and is capable of reasoning in context. The adapted, party-specific, models are substantially re-aligned towards respective positions which we see as a starting point for using chat-based LLMs as data-driven conversational engines to assist research in political science.
comment: accepted to NAACL 2024 as a short paper
☆ Teacher-Student Training for Debiasing: General Permutation Debiasing for Large Language Models
Large Language Models (LLMs) have demonstrated impressive zero-shot capabilities and versatility in NLP tasks, however they sometimes fail to maintain crucial invariances for specific tasks. One example is permutation sensitivity, where LLMs' outputs may significantly vary depending on the order of the input options. While debiasing techniques can mitigate these issues, and yield better performance and reliability, they often come with a high computational cost at inference. This paper addresses this inefficiency at inference time. The aim is to distill the capabilities of a computationally intensive, debiased, teacher model into a more compact student model. We explore two variants of student models: one based on pure distillation, and the other on an error-correction approach for more complex tasks, where the student corrects a single biased decision from the teacher to achieve a debiased output. Our approach is general and can be applied to both black-box and white-box LLMs. Furthermore, we demonstrate that our compact, encoder-only student models can outperform their larger, biased teacher counterparts, achieving better results with significantly fewer parameters.
☆ Genetic Auto-prompt Learning for Pre-trained Code Intelligence Language Models
As Pre-trained Language Models (PLMs), a popular approach for code intelligence, continue to grow in size, the computational cost of their usage has become prohibitively expensive. Prompt learning, a recent development in the field of natural language processing, emerges as a potential solution to address this challenge. In this paper, we investigate the effectiveness of prompt learning in code intelligence tasks. We unveil its reliance on manually designed prompts, which often require significant human effort and expertise. Moreover, we discover existing automatic prompt design methods are very limited to code intelligence tasks due to factors including gradient dependence, high computational demands, and limited applicability. To effectively address both issues, we propose Genetic Auto Prompt (GenAP), which utilizes an elaborate genetic algorithm to automatically design prompts. With GenAP, non-experts can effortlessly generate superior prompts compared to meticulously manual-designed ones. GenAP operates without the need for gradients or additional computational costs, rendering it gradient-free and cost-effective. Moreover, GenAP supports both understanding and generation types of code intelligence tasks, exhibiting great applicability. We conduct GenAP on three popular code intelligence PLMs with three canonical code intelligence tasks including defect prediction, code summarization, and code translation. The results suggest that GenAP can effectively automate the process of designing prompts. Specifically, GenAP outperforms all other methods across all three tasks (e.g., improving accuracy by an average of 2.13% for defect prediction). To the best of our knowledge, GenAP is the first work to automatically design prompts for code intelligence PLMs.
☆ CONLINE: Complex Code Generation and Refinement with Online Searching and Correctness Testing
Large Language Models (LLMs) have revolutionized code generation ability by converting natural language descriptions into executable code. However, generating complex code within real-world scenarios remains challenging due to intricate structures, subtle bugs, understanding of advanced data types, and lack of supplementary contents. To address these challenges, we introduce the CONLINE framework, which enhances code generation by incorporating planned online searches for information retrieval and automated correctness testing for iterative refinement. CONLINE also serializes the complex inputs and outputs to improve comprehension and generate test case to ensure the framework's adaptability for real-world applications. CONLINE is validated through rigorous experiments on the DS-1000 and ClassEval datasets. It shows that CONLINE substantially improves the quality of complex code generation, highlighting its potential to enhance the practicality and reliability of LLMs in generating intricate code.
☆ Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
In this paper, we study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We focus on the task of counselor reflection generation, where we optimize the generators to simultaneously improve the fluency, coherence, and reflection quality of generated counselor responses. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously. Specifically, we employ non-contextual and contextual multi-arm bandits to dynamically adjust multiple reward weights during training. Through automatic and manual evaluations, we show that our proposed techniques, DynaOpt and C-DynaOpt, outperform existing naive and bandit baselines, showcasing their potential for enhancing language models.
☆ eRST: A Signaled Graph Theory of Discourse Relations and Organization
In this article we present Enhanced Rhetorical Structure Theory (eRST), a new theoretical framework for computational discourse analysis, based on an expansion of Rhetorical Structure Theory (RST). The framework encompasses discourse relation graphs with tree-breaking, nonprojective and concurrent relations, as well as implicit and explicit signals which give explainable rationales to our analyses. We survey shortcomings of RST and other existing frameworks, such as Segmented Discourse Representation Theory (SDRT), the Penn Discourse Treebank (PDTB) and Discourse Dependencies, and address these using constructs in the proposed theory. We provide annotation, search and visualization tools for data, and present and evaluate a freely available corpus of English annotated according to our framework, encompassing 12 spoken and written genres with over 200K tokens. Finally, we discuss automatic parsing, evaluation metrics and applications for data in our framework.
☆ What explains the success of cross-modal fine-tuning with ORCA?
ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.
☆ Motion Generation from Fine-grained Textual Descriptions
The task of text2motion is to generate motion sequences from given textual descriptions, where a model should explore the interactions between natural language instructions and human body movements. While most existing works are confined to coarse-grained motion descriptions (e.g., "A man squats."), fine-grained ones specifying movements of relevant body parts are barely explored. Models trained with coarse texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure in generating motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset with fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with delicate prompts. Accordingly, we design a new text2motion model, FineMotionDiffuse, which makes full use of fine-grained textual information. Our experiments show that FineMotionDiffuse trained on FineHumanML3D acquires good results in quantitative evaluation. We also find this model can better generate spatially/chronologically composite motions by learning the implicit mappings from simple descriptions to the corresponding basic motions.
☆ How Gender Interacts with Political Values: A Case Study on Czech BERT Models LREC
Neural language models, which reach state-of-the-art results on most natural language processing tasks, are trained on large text corpora that inevitably contain value-burdened content and often capture undesirable biases, which the models reflect. This case study focuses on the political biases of pre-trained encoders in Czech and compares them with a representative value survey. Because Czech is a gendered language, we also measure how the grammatical gender coincides with responses to men and women in the survey. We introduce a novel method for measuring the model's perceived political values. We find that the models do not assign statement probability following value-driven reasoning, and there is no systematic difference between feminine and masculine sentences. We conclude that BERT-sized models do not manifest systematic alignment with political values and that the biases observed in the models are rather due to superficial imitation of training data patterns than systematic value beliefs encoded in the models.
comment: 11 pages, 2 figures; LREC-COLING 2024
☆ What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
This paper presents a way of enhancing the reliability of Large Multimodal Models (LMMs) in addressing hallucination effects, where models generate incorrect or unrelated responses. Without additional instruction tuning paradigm, we introduce Counterfactual Inception, a novel method that implants counterfactual thoughts into LMMs using carefully chosen, misaligned counterfactual keywords. This method is grounded in the concept of counterfactual thinking, a cognitive process where humans consider alternative realities and outcomes. By applying this human-like reasoning mechanism to LMMs, we aim to reduce hallucination effects and improve the models' trustworthiness. We also propose Dual-modality Verification Process (DVP), a rigorous framework for selecting optimal counterfactual keywords to trigger counterfactual thinking into LMMs, concurrently considering visual and linguistic context. Our extensive experiments across various LMMs, including both open-source and proprietary models, corroborate that our method significantly mitigates hallucination phenomena across different datasets.
comment: under review, code available: https://github.com/IVY-LVLM/Counterfactual-Inception
☆ An Entropy-based Text Watermarking Detection Method
Currently, text watermarking algorithms for large language models (LLMs) can embed hidden features to texts generated by LLMs to facilitate subsequent detection, thus alleviating the problem of misuse of LLMs. Although the current text watermarking algorithms perform well in most high-entropy scenarios, its performance in low-entropy scenarios still needs to be improved. In this work, we proposed that the influence of token entropy should be fully considered in the watermark detection process, that is, the weight of each token should be adjusted according to its entropy during watermark detection, rather than setting the weight of all tokens to the same value as in previous methods. Specifically, we proposed an Entropy-based Watermark Detection (EWD) that gives higher-entropy tokens higher weights during watermark detection, so as to better reflect the degree of watermarking. Furthermore, the proposed detection process is training-free and fully automated. %In actual detection, we use a proxy-LLM to calculate the entropy of each token, without the need to use the original LLM. In the experiment, we found that our method can achieve better detection performance in low-entropy scenarios, and our method is also general and can be applied to texts with different entropy distributions. Our code and data will be available online.
comment: 8 pages, 5 figures, submitted to ARR Feb 2024
☆ HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.
☆ Agent Group Chat: An Interactive Group Chat Simulacra For Better Eliciting Collective Emergent Behavior
To investigate the role of language in human collective behaviors, we developed the Agent Group Chat simulation to simulate linguistic interactions among multi-agent in different settings. Agents are asked to free chat in this simulation for their own purposes based on their character setting, aiming to see agents exhibit emergent behaviours that are both unforeseen and significant. Four narrative scenarios, Inheritance Disputes, Law Court Debates, Philosophical Discourses, Movie Casting Contention, are integrated into Agent Group Chat to evaluate its support for diverse storylines. By configuring specific environmental settings within Agent Group Chat, we are able to assess whether agents exhibit behaviors that align with human expectations. We evaluate the disorder within the environment by computing the n-gram Shannon entropy of all the content speak by characters. Our findings reveal that under the premise of agents possessing substantial alignment with human expectations, facilitating more extensive information exchange within the simulation ensures greater orderliness amidst diversity, which leads to the emergence of more unexpected and meaningful emergent behaviors. The code is open source in https://github.com/MikeGu721/AgentGroup, and online platform will be open soon.
☆ LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LlamaFactory, a unified framework that integrates a suite of cutting-edge efficient training methods. It allows users to flexibly customize the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks. It has been released at https://github.com/hiyouga/LLaMA-Factory and already received over 13,000 stars and 1,600 forks.
comment: 12 pages, preprint
☆ Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting
Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource.
☆ Computational Models to Study Language Processing in the Human Brain: A Survey
Despite differing from the human language processing mechanism in implementation and algorithms, current language models demonstrate remarkable human-like or surpassing language capabilities. Should computational language models be employed in studying the brain, and if so, when and how? To delve into this topic, this paper reviews efforts in using computational models for brain research, highlighting emerging trends. To ensure a fair comparison, the paper evaluates various computational models using consistent metrics on the same dataset. Our analysis reveals that no single model outperforms others on all datasets, underscoring the need for rich testing datasets and rigid experimental control to draw robust conclusions in studies involving computational models.
☆ Incentivizing News Consumption on Social Media Platforms Using Large Language Models and Realistic Bot Accounts
Polarization, declining trust, and wavering support for democratic norms are pressing threats to U.S. democracy. Exposure to verified and quality news may lower individual susceptibility to these threats and make citizens more resilient to misinformation, populism, and hyperpartisan rhetoric. This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment (from 1/19/2023 to 2/3/2023) on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. To further test differential effects by gender of the bots, treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our over-time intervention enhances the following of news media organization, the sharing and the liking of news content and the tweeting about politics and the liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control. Most of these results, however, were small in magnitude and confined to the already politically interested Twitter users, as indicated by their pre-treatment tweeting about politics. These findings have implications for social media and news organizations, and also offer direction for future work on how Large Language Models and other computational interventions can effectively enhance individual on-platform engagement with quality news and public affairs.
☆ USE: Dynamic User Modeling with Stateful Sequence Models
User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.
☆ Hyacinth6B: A large language model for Traditional Chinese
This research's primary motivation of this study is to address the high hardware and computational demands typically associated with LLMs.Therefore,our goal is to find a balance between model lightness and performance,striving to maximize performance while using a comparatively lightweight model. Hyacinth6B was developed with this objective in mind,aiming to fully leverage the core capabilities of LLMs without incurring substantial resource costs, effectively pushing the boundaries of smaller model's performance. The training approach involves parameter efficient finetuning using the LoRA method.
comment: 14pages
☆ Polaris: A Safety-focused LLM Constellation Architecture for Healthcare
We develop Polaris, the first safety-focused LLM constellation for real-time patient-AI healthcare conversations. Unlike prior LLM works in healthcare focusing on tasks like question answering, our work specifically focuses on long multi-turn voice conversations. Our one-trillion parameter constellation system is composed of several multibillion parameter LLMs as co-operative agents: a stateful primary agent that focuses on driving an engaging conversation and several specialist support agents focused on healthcare tasks performed by nurses to increase safety and reduce hallucinations. We develop a sophisticated training protocol for iterative co-training of the agents that optimize for diverse objectives. We train our models on proprietary data, clinical care plans, healthcare regulatory documents, medical manuals, and other medical reasoning documents. We align our models to speak like medical professionals, using organic healthcare conversations and simulated ones between patient actors and experienced nurses. This allows our system to express unique capabilities such as rapport building, trust building, empathy and bedside manner. Finally, we present the first comprehensive clinician evaluation of an LLM system for healthcare. We recruited over 1100 U.S. licensed nurses and over 130 U.S. licensed physicians to perform end-to-end conversational evaluations of our system by posing as patients and rating the system on several measures. We demonstrate Polaris performs on par with human nurses on aggregate across dimensions such as medical safety, clinical readiness, conversational quality, and bedside manner. Additionally, we conduct a challenging task-based evaluation of the individual specialist support agents, where we demonstrate our LLM agents significantly outperform a much larger general-purpose LLM (GPT-4) as well as from its own medium-size class (LLaMA-2 70B).
☆ LeanReasoner: Boosting Complex Logical Reasoning with Lean NAACL 2024
Large language models (LLMs) often struggle with complex logical reasoning due to logical inconsistencies and the inherent difficulty of such reasoning. We use Lean, a theorem proving framework, to address these challenges. By formalizing logical reasoning problems into theorems within Lean, we can solve them by proving or disproving the corresponding theorems. This method reduces the risk of logical inconsistencies with the help of Lean's symbolic solver. It also enhances our ability to treat complex reasoning tasks by using Lean's extensive library of theorem proofs. Our method achieves state-of-the-art performance on the FOLIO dataset and achieves performance near this level on ProofWriter. Notably, these results were accomplished by fine-tuning on fewer than 100 in-domain samples for each dataset.
comment: Accepted to NAACL 2024 main conference
☆ Reading Users' Minds from What They Say: An Investigation into LLM-based Empathic Mental Inference
In human-centered design, developing a comprehensive and in-depth understanding of user experiences, i.e., empathic understanding, is paramount for designing products that truly meet human needs. Nevertheless, accurately comprehending the real underlying mental states of a large human population remains a significant challenge today. This difficulty mainly arises from the trade-off between depth and scale of user experience research: gaining in-depth insights from a small group of users does not easily scale to a larger population, and vice versa. This paper investigates the use of Large Language Models (LLMs) for performing mental inference tasks, specifically inferring users' underlying goals and fundamental psychological needs (FPNs). Baseline and benchmark datasets were collected from human users and designers to develop an empathic accuracy metric for measuring the mental inference performance of LLMs. The empathic accuracy of inferring goals and FPNs of different LLMs with varied zero-shot prompt engineering techniques are experimented against that of human designers. Experimental results suggest that LLMs can infer and understand the underlying goals and FPNs of users with performance comparable to that of human designers, suggesting a promising avenue for enhancing the scalability of empathic design approaches through the integration of advanced artificial intelligence technologies. This work has the potential to significantly augment the toolkit available to designers during human-centered design, enabling the development of both large-scale and in-depth understanding of users' experiences.
comment: Submitted to IDETC-CIE2024
☆ Community Needs and Assets: A Computational Analysis of Community Conversations
A community needs assessment is a tool used by non-profits and government agencies to quantify the strengths and issues of a community, allowing them to allocate their resources better. Such approaches are transitioning towards leveraging social media conversations to analyze the needs of communities and the assets already present within them. However, manual analysis of exponentially increasing social media conversations is challenging. There is a gap in the present literature in computationally analyzing how community members discuss the strengths and needs of the community. To address this gap, we introduce the task of identifying, extracting, and categorizing community needs and assets from conversational data using sophisticated natural language processing methods. To facilitate this task, we introduce the first dataset about community needs and assets consisting of 3,511 conversations from Reddit, annotated using crowdsourced workers. Using this dataset, we evaluate an utterance-level classification model compared to sentiment classification and a popular large language model (in a zero-shot setting), where we find that our model outperforms both baselines at an F1 score of 94% compared to 49% and 61% respectively. Furthermore, we observe through our study that conversations about needs have negative sentiments and emotions, while conversations about assets focus on location and entities. The dataset is available at https://github.com/towhidabsar/CommunityNeeds.
☆ AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models
We present a novel Parameter-Efficient Fine-Tuning (PEFT) method, dubbed as Adaptive Freezing of Low Rank Adaptation (AFLoRA). Specifically, for each pre-trained frozen weight tensor, we add a parallel path of trainable low-rank matrices, namely a down-projection and an up-projection matrix, each of which is followed by a feature transformation vector. Based on a novel freezing score, we the incrementally freeze these projection matrices during fine-tuning to reduce the computation and alleviate over-fitting. Our experimental results demonstrate that we can achieve state-of-the-art performance with an average improvement of up to $0.85\%$ as evaluated on GLUE benchmark while yeilding up to $9.5\times$ fewer average trainable parameters. While compared in terms of runtime, AFLoRA can yield up to $1.86\times$ improvement as opposed to similar PEFT alternatives. Besides the practical utility of our approach, we provide insights on the trainability requirements of LoRA paths at different modules and the freezing schedule for the different projection matrices. Code will be released.
comment: 5 pages, 5 figures
☆ Arcee's MergeKit: A Toolkit for Merging Large Language Models
The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pre-trained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multi-task learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.
comment: 11 pages, 4 figures
☆ Document Author Classification Using Parsed Language Structure
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of \emph{The Federalist Papers}. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of "proof texts," The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statistical natural language parser were explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
☆ Facilitating Pornographic Text Detection for Open-Domain Dialogue Systems via Knowledge Distillation of Large Language Models SC
Pornographic content occurring in human-machine interaction dialogues can cause severe side effects for users in open-domain dialogue systems. However, research on detecting pornographic language within human-machine interaction dialogues is an important subject that is rarely studied. To advance in this direction, we introduce CensorChat, a dialogue monitoring dataset aimed at detecting whether the dialogue session contains pornographic content. To this end, we collect real-life human-machine interaction dialogues in the wild and break them down into single utterances and single-turn dialogues, with the last utterance spoken by the chatbot. We propose utilizing knowledge distillation of large language models to annotate the dataset. Specifically, first, the raw dataset is annotated by four open-source large language models, with the majority vote determining the label. Second, we use ChatGPT to update the empty label from the first step. Third, to ensure the quality of the validation and test sets, we utilize GPT-4 for label calibration. If the current label does not match the one generated by GPT-4, we employ a self-criticism strategy to verify its correctness. Finally, to facilitate the detection of pornographic text, we develop a series of text classifiers using a pseudo-labeled dataset. Detailed data analysis demonstrates that leveraging knowledge distillation techniques with large language models provides a practical and cost-efficient method for developing pornographic text detectors.
comment: Accepted to CSCWD 2024 (27th International Conference on Computer Supported Cooperative Work in Design). arXiv admin note: text overlap with arXiv:2309.09749
☆ Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model
While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 88.08%, 65.27%, and 61.44%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science. Code is available at https://github.com/HHW-zhou/TSMMG.
comment: 25 pages, 4 figures
☆ SumTra: A Differentiable Pipeline for Few-Shot Cross-Lingual Summarization NAACL 2024
Cross-lingual summarization (XLS) generates summaries in a language different from that of the input documents (e.g., English to Spanish), allowing speakers of the target language to gain a concise view of their content. In the present day, the predominant approach to this task is to take a performing, pretrained multilingual language model (LM) and fine-tune it for XLS on the language pairs of interest. However, the scarcity of fine-tuning samples makes this approach challenging in some cases. For this reason, in this paper we propose revisiting the summarize-and-translate pipeline, where the summarization and translation tasks are performed in a sequence. This approach allows reusing the many, publicly-available resources for monolingual summarization and translation, obtaining a very competitive zero-shot performance. In addition, the proposed pipeline is completely differentiable end-to-end, allowing it to take advantage of few-shot fine-tuning, where available. Experiments over two contemporary and widely adopted XLS datasets (CrossSum and WikiLingua) have shown the remarkable zero-shot performance of the proposed approach, and also its strong few-shot performance compared to an equivalent multilingual LM baseline, that the proposed approach has been able to outperform in many languages with only 10% of the fine-tuning samples.
comment: Accepted to NAACL 2024
☆ Technical Report: Competition Solution For BetterMixture
In the era of flourishing large-scale models, the challenge of selecting and optimizing datasets from the vast and complex sea of data, to enhance the performance of large language models within the constraints of limited computational resources, has become paramount. This paper details our solution for the BetterMixture challenge, which focuses on the fine-tuning data mixing for large language models. Our approach, which secured third place, incorporates data deduplication, low-level and high-level quality filtering, and diversity selection. The foundation of our solution is Ke-Data-Juicer, an extension of Data-Juicer, demonstrating its robust capabilities in handling and optimizing data for large language models.
comment: 6 pages
☆ From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards ACL 2024
Recent progress in large language models (LLMs) has led to their widespread adoption in various domains. However, these advancements have also introduced additional safety risks and raised concerns regarding their detrimental impact on already marginalized populations. Despite growing mitigation efforts to develop safety safeguards, such as supervised safety-oriented fine-tuning and leveraging safe reinforcement learning from human feedback, multiple concerns regarding the safety and ingrained biases in these models remain. Furthermore, previous work has demonstrated that models optimized for safety often display exaggerated safety behaviors, such as a tendency to refrain from responding to certain requests as a precautionary measure. As such, a clear trade-off between the helpfulness and safety of these models has been documented in the literature. In this paper, we further investigate the effectiveness of safety measures by evaluating models on already mitigated biases. Using the case of Llama 2 as an example, we illustrate how LLMs' safety responses can still encode harmful assumptions. To do so, we create a set of non-toxic prompts, which we then use to evaluate Llama models. Through our new taxonomy of LLMs responses to users, we observe that the safety/helpfulness trade-offs are more pronounced for certain demographic groups which can lead to quality-of-service harms for marginalized populations.
comment: 9 pages, 4 figures, submitted to the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
☆ Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection
Misinformation can seriously impact society, affecting anything from public opinion to institutional confidence and the political horizon of a state. Fake News (FN) proliferation on online websites and Online Social Networks (OSNs) has increased profusely. Various fact-checking websites include news in English and barely provide information about FN in regional languages. Thus the Urdu FN purveyors cannot be discerned using factchecking portals. SOTA approaches for Fake News Detection (FND) count upon appropriately labelled and large datasets. FND in regional and resource-constrained languages lags due to the lack of limited-sized datasets and legitimate lexical resources. The previous datasets for Urdu FND are limited-sized, domain-restricted, publicly unavailable and not manually verified where the news is translated from English into Urdu. In this paper, we curate and contribute the first largest publicly available dataset for Urdu FND, Ax-to-Grind Urdu, to bridge the identified gaps and limitations of existing Urdu datasets in the literature. It constitutes 10,083 fake and real news on fifteen domains collected from leading and authentic Urdu newspapers and news channel websites in Pakistan and India. FN for the Ax-to-Grind dataset is collected from websites and crowdsourcing. The dataset contains news items in Urdu from the year 2017 to the year 2023. Expert journalists annotated the dataset. We benchmark the dataset with an ensemble model of mBERT,XLNet, and XLM RoBERTa. The selected models are originally trained on multilingual large corpora. The results of the proposed model are based on performance metrics, F1-score, accuracy, precision, recall and MCC value.
☆ A New Massive Multilingual Dataset for High-Performance Language Technologies LREC
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
comment: LREC-COLING 2024
☆ On Prompt Sensitivity of ChatGPT in Affective Computing
Recent studies have demonstrated the emerging capabilities of foundation models like ChatGPT in several fields, including affective computing. However, accessing these emerging capabilities is facilitated through prompt engineering. Despite the existence of some prompting techniques, the field is still rapidly evolving and many prompting ideas still require investigation. In this work, we introduce a method to evaluate and investigate the sensitivity of the performance of foundation models based on different prompts or generation parameters. We perform our evaluation on ChatGPT within the scope of affective computing on three major problems, namely sentiment analysis, toxicity detection, and sarcasm detection. First, we carry out a sensitivity analysis on pivotal parameters in auto-regressive text generation, specifically the temperature parameter $T$ and the top-$p$ parameter in Nucleus sampling, dictating how conservative or creative the model should be during generation. Furthermore, we explore the efficacy of several prompting ideas, where we explore how giving different incentives or structures affect the performance. Our evaluation takes into consideration performance measures on the affective computing tasks, and the effectiveness of the model to follow the stated instructions, hence generating easy-to-parse responses to be smoothly used in downstream applications.
comment: 2 Tables, 1 Figure, preprint submission to ACII 2024
☆ Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings
Sentence embeddings produced by Pretrained Language Models (PLMs) have received wide attention from the NLP community due to their superior performance when representing texts in numerous downstream applications. However, the high dimensionality of the sentence embeddings produced by PLMs is problematic when representing large numbers of sentences in memory- or compute-constrained devices. As a solution, we evaluate unsupervised dimensionality reduction methods to reduce the dimensionality of sentence embeddings produced by PLMs. Our experimental results show that simple methods such as Principal Component Analysis (PCA) can reduce the dimensionality of sentence embeddings by almost $50\%$, without incurring a significant loss in performance in multiple downstream tasks. Surprisingly, reducing the dimensionality further improves performance over the original high-dimensional versions for the sentence embeddings produced by some PLMs in some tasks.
☆ Reducing Large Language Model Bias with Emphasis on 'Restricted Industries': Automated Dataset Augmentation and Prejudice Quantification
Despite the growing capabilities of large language models, there exists concerns about the biases they develop. In this paper, we propose a novel, automated mechanism for debiasing through specified dataset augmentation in the lens of bias producers and in the context of 'restricted industries' with limited data. We additionally create two new additional metrics, the mb-index and db-index, to quantify bias, considering the idea that bias occurs due to both intrinsic model architecture and dataset.
☆ Visually Grounded Speech Models have a Mutual Exclusivity Bias ACL
When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: a novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio. Concretely, we train a model on familiar words and test its ME bias by asking it to select between a novel and a familiar object when queried with a novel word. To simulate prior acoustic and visual knowledge, we experiment with several initialisation strategies using pretrained speech and vision networks. Our findings reveal the ME bias across the different initialisation approaches, with a stronger bias in models with more prior (in particular, visual) knowledge. Additional tests confirm the robustness of our results, even when different loss functions are considered.
comment: Accepted to TACL, pre-MIT Press publication version
☆ Leveraging Linguistically Enhanced Embeddings for Open Information Extraction LREC
Open Information Extraction (OIE) is a structured prediction (SP) task in Natural Language Processing (NLP) that aims to extract structured $n$-ary tuples - usually subject-relation-object triples - from free text. The word embeddings in the input text can be enhanced with linguistic features, usually Part-of-Speech (PoS) and Syntactic Dependency Parse (SynDP) labels. However, past enhancement techniques cannot leverage the power of pretrained language models (PLMs), which themselves have been hardly used for OIE. To bridge this gap, we are the first to leverage linguistic features with a Seq2Seq PLM for OIE. We do so by introducing two methods - Weighted Addition and Linearized Concatenation. Our work can give any neural OIE architecture the key performance boost from both PLMs and linguistic features in one go. In our settings, this shows wide improvements of up to 24.9%, 27.3% and 14.9% on Precision, Recall and F1 scores respectively over the baseline. Beyond this, we address other important challenges in the field: to reduce compute overheads with the features, we are the first ones to exploit Semantic Dependency Parse (SemDP) tags; to address flaws in current datasets, we create a clean synthetic dataset; finally, we contribute the first known study of OIE behaviour in SP models.
comment: Accepted at LREC-COLING 2024 Main Conference, Long Paper
☆ Train & Constrain: Phonologically Informed Tongue-Twister Generation from Topics and Paraphrases
Previous work in phonologically and phonetically grounded language generation has mainly focused on domains such as puns and poetry. In this article, we present new work on the generation of tongue-twisters - a form of language that is required to be conditioned on a phoneme level to maximize sound overlap, whilst maintaining semantic consistency with an input topic and still being grammatically correct. We present TwisterLister, a pipeline for generating phonologically informed tongue-twisters from Large Language Models (LLMs) that we use to generate TwistList 2.0, the largest annotated dataset of tongue-twisters to date, consisting of 17K+ examples from a combination of human and LLM authors. Our generation pipeline involves the use of a phonologically constrained vocabulary alongside LLM prompting to generate novel, non-derivative tongue-twister examples. We additionally present the results of automatic and human evaluation of smaller models trained on our generated dataset to demonstrate the extent to which phonologically motivated language types can be generated without explicit injection of phonological knowledge. Additionally, we introduce a Phoneme-Aware Constrained Decoding module (PACD) that can be integrated into any causal language model and demonstrate that this method generates good quality tongue-twisters both with and without fine-tuning the underlying language model. We also design and implement a range of automatic metrics for the task of tongue-twister generation that is phonologically motivated and captures the unique essence of tongue-twisters based on Phonemic Edit Distance (PED).
comment: Submitted to Computational Linguistics
☆ Multi-Modal Hallucination Control by Visual Information Grounding
Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.
♻ ☆ The Expressive Power of Transformers with Chain of Thought ICLR
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming a slight generalization to standard pre-norm, adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.
comment: 9-page preprint. Updated March 20 after ICLR acceptance
♻ ☆ m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
♻ ☆ Having Beer after Prayer? Measuring Cultural Bias in Large Language Models
As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures. CAMeL provides a foundation for measuring cultural biases in LMs through both extrinsic and intrinsic evaluations. Using CAMeL, we examine the cross-cultural performance in Arabic of 16 different LMs on tasks such as story generation, NER, and sentiment analysis, where we find concerning cases of stereotyping and cultural unfairness. We further test their text-infilling performance, revealing the incapability of appropriate adaptation to Arab cultural contexts. Finally, we analyze 6 Arabic pre-training corpora and find that commonly used sources such as Wikipedia may not be best suited to build culturally aware LMs, if used as they are without adjustment. We will make CAMeL publicly available at: https://github.com/tareknaous/camel
♻ ☆ HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models LREC
Large language models (LLMs) trained on massive corpora demonstrate impressive capabilities in a wide range of tasks. While there are ongoing efforts to adapt these models to languages beyond English, the attention given to their evaluation methodologies remains limited. Current multilingual benchmarks often rely on back translations or re-implementations of English tests, limiting their capacity to capture unique cultural and linguistic nuances. To bridge this gap for the Korean language, we introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth. The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension. Unlike traditional evaluation suites focused on token and sequence classification or mathematical and logical reasoning, the HAE-RAE Bench emphasizes a model's aptitude for recalling Korean-specific knowledge and cultural contexts. Comparative analysis with prior Korean benchmarks indicates that the HAE-RAE Bench presents a greater challenge to non-Korean models by disturbing abilities and knowledge learned from English being transferred.
comment: Accepted at LREC-COLING 2024
♻ ☆ Enhancing Phrase Representation by Information Bottleneck Guided Text Diffusion Process for Keyphrase Extraction LREC
Keyphrase extraction (KPE) is an important task in Natural Language Processing for many scenarios, which aims to extract keyphrases that are present in a given document. Many existing supervised methods treat KPE as sequential labeling, span-level classification, or generative tasks. However, these methods lack the ability to utilize keyphrase information, which may result in biased results. In this study, we propose Diff-KPE, which leverages the supervised Variational Information Bottleneck (VIB) to guide the text diffusion process for generating enhanced keyphrase representations. Diff-KPE first generates the desired keyphrase embeddings conditioned on the entire document and then injects the generated keyphrase embeddings into each phrase representation. A ranking network and VIB are then optimized together with rank loss and classification loss, respectively. This design of Diff-KPE allows us to rank each candidate phrase by utilizing both the information of keyphrases and the document. Experiments show that Diff-KPE outperforms existing KPE methods on a large open domain keyphrase extraction benchmark, OpenKP, and a scientific domain dataset, KP20K.
comment: 10 pages, 2 figures, accepted to LREC-COLING 2024
♻ ☆ AutoMix: Automatically Mixing Language Models
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present AutoMix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to AutoMix is a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring training. Given that verifications can be noisy, we employ a meta-verifier in AutoMix to refine the accuracy of these assessments. Our experiments using LLAMA2-13B and GPT-4, on five context-grounded reasoning datasets demonstrate that AutoMix surpasses established baselines, improving the incremental benefit per cost by up to 86%. Our code and data are available at https://github.com/automix-llm/automix.
comment: The first two authors contributed equally. Work started and partly done during Aman's internship at Google. This version adds results on additional models and datasets
♻ ☆ MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning ICLR2024
Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC
comment: Accepted by ICLR2024
♻ ☆ Correct Like Humans: Progressive Learning Framework for Chinese Text Error Correction
Chinese Text Error Correction (CTEC) aims to detect and correct errors in the input text, which benefits human daily life and various downstream tasks. Recent approaches mainly employ Pre-trained Language Models (PLMs) to resolve CTEC. Although PLMs have achieved remarkable success in CTEC, we argue that previous studies still overlook the importance of human thinking patterns. To enhance the development of PLMs for CTEC, inspired by humans' daily error-correcting behavior, we propose a novel model-agnostic progressive learning framework, named ProTEC, which guides PLMs-based CTEC models to learn to correct like humans. During the training process, ProTEC guides the model to learn text error correction by incorporating these sub-tasks into a progressive paradigm. During the inference process, the model completes these sub-tasks in turn to generate the correction results. Extensive experiments and detailed analyses demonstrate the effectiveness and efficiency of our proposed model-agnostic ProTEC framework.
♻ ☆ Weight-Inherited Distillation for Task-Agnostic BERT Compression NAACL2024
Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model. These methods transfer the knowledge in an indirect way. In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher. WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation. Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization. Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines. Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions. The code is available at https://github.com/wutaiqiang/WID-NAACL2024.
comment: 9 pages, 4 figures, NAACL2024 findings
♻ ☆ AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
In our research, we pioneer a novel approach to evaluate the effectiveness of jailbreak attacks on Large Language Models (LLMs), such as GPT-4 and LLaMa2, diverging from traditional robustness-focused binary evaluations. Our study introduces two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework, using a scoring range from 0 to 1, offers a unique perspective, enabling a more comprehensive and nuanced evaluation of attack effectiveness and empowering attackers to refine their attack prompts with greater understanding. Furthermore, we have developed a comprehensive ground truth dataset specifically tailored for jailbreak tasks. This dataset not only serves as a crucial benchmark for our current study but also establishes a foundational resource for future research, enabling consistent and comparative analyses in this evolving field. Upon meticulous comparison with traditional evaluation methods, we discovered that our evaluation aligns with the baseline's trend while offering a more profound and detailed assessment. We believe that by accurately evaluating the effectiveness of attack prompts in the Jailbreak task, our work lays a solid foundation for assessing a wider array of similar or even more complex tasks in the realm of prompt injection, potentially revolutionizing this field.
♻ ☆ Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets
Large language models (LLMs) have received a lot of attention in natural language processing (NLP) research because of their exceptional performance in understanding and generating human languages. However, low-resource languages are left behind due to the unavailability of resources. In this work, we focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets to improve language model performance for Amharic. We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model. The fine-tuned model shows promising results in different NLP tasks. We open-source our dataset creation pipeline, instruction datasets, trained models, and evaluation outputs to promote language-specific studies on these models.
♻ ☆ Do Language Models Know When They're Hallucinating References?
State-of-the-art language models (LMs) are notoriously susceptible to generating hallucinated information. Such inaccurate outputs not only undermine the reliability of these models but also limit their use and raise serious concerns about misinformation and propaganda. In this work, we focus on hallucinated book and article references and present them as the "model organism" of language model hallucination research, due to their frequent and easy-to-discern nature. We posit that if a language model cites a particular reference in its output, then it should ideally possess sufficient information about its authors and content, among other relevant details. Using this basic insight, we illustrate that one can identify hallucinated references without ever consulting any external resources, by asking a set of direct or indirect queries to the language model about the references. These queries can be considered as "consistency checks." Our findings highlight that while LMs, including GPT-4, often produce inconsistent author lists for hallucinated references, they also often accurately recall the authors of real references. In this sense, the LM can be said to "know" when it is hallucinating references. Furthermore, these findings show how hallucinated references can be dissected to shed light on their nature. Replication code and results can be found at https://github.com/microsoft/hallucinated-references.
♻ ☆ Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models
Detectives frequently engage in information detection and reasoning simultaneously when making decisions across various cases, especially when confronted with a vast amount of information. With the rapid development of large language models~(LLMs), evaluating how these models identify key information and reason to solve questions becomes increasingly relevant. We introduces the DetectBench, a reading comprehension dataset designed to assess a model's ability to jointly ability in key information detection and multi-hop reasoning when facing complex and implicit information. The DetectBench comprises 3,928 questions, each paired with a paragraph averaging 190 tokens in length. To enhance model's detective skills, we propose the Detective Thinking Framework. These methods encourage models to identify all possible clues within the context before reasoning. Our experiments reveal that existing models perform poorly in both information detection and multi-hop reasoning. However, the Detective Thinking Framework approach alleviates this issue.
♻ ☆ EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation SC
Emotion Recognition in Conversation (ERC) has attracted growing attention in recent years as a result of the advancement and implementation of human-computer interface technologies. In this paper, we propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task. Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU), and Skip-chain Conditional Random Field (SkipCRF). Compared to previous ERC models, EmotionIC can model a conversation more thoroughly at both the feature-extraction and classification levels. The proposed model attempts to integrate the advantages of attention- and recurrence-based methods at the feature-extraction level. Specifically, IMMHA is applied to capture identity-based global contextual dependencies, while DiaGRU is utilized to extract speaker- and temporal-aware local contextual information. At the classification level, SkipCRF can explicitly mine complex emotional flows from higher-order neighboring utterances in the conversation. Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets. The ablation studies confirm that our modules can effectively model emotional inertia and contagion.
comment: Accepted by SCIENCE CHINA Information Sciences (SCIS)
♻ ☆ General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History
Developing clinical prediction models (e.g., mortality prediction) based on electronic health records (EHRs) typically relies on expert opinion for feature selection and adjusting observation window size. This burdens experts and creates a bottleneck in the development process. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate an unlimited number of clinical events, select the relevant ones, and make predictions. This approach effectively eliminates the need for manual feature selection and enables an unrestricted observation window. We verified these properties through experiments on 27 clinical tasks and two independent cohorts from publicly available EHR datasets, where REMed outperformed other contemporary architectures that aim to handle as many events as possible. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.
comment: The source codes corresponding to this paper are available at: https://github.com/starmpcc/REMed
♻ ☆ The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings
The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings. The paper additionally introduces the first domain-specific multilingual transformer language model for political science applications, which was additionally pre-trained on 1.72 billion words from parliamentary proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training on parliamentary data can significantly improve the model downstream performance, in our case, sentiment identification in parliamentary proceedings. We further show that our multilingual model performs very well on languages not seen during fine-tuning, and that additional fine-tuning data from other languages significantly improves the target parliament's results. The paper makes an important contribution to multiple disciplines inside the social sciences, and bridges them with computer science and computational linguistics. Lastly, the resulting fine-tuned language model sets up a more robust approach to sentiment analysis of political texts across languages, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.
♻ ☆ The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation Framework
The advancement of Multi-modal Pre-training highlights the necessity for a robust Multi-Modal Knowledge Graph (MMKG) representation learning framework. This framework is crucial for integrating structured knowledge into multi-modal Large Language Models (LLMs) at scale, aiming to alleviate issues like knowledge misconceptions and multi-modal hallucinations. In this work, to evaluate models' ability to accurately embed entities within MMKGs, we focus on two widely researched tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking for the robust integration of multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets (three for MKGC and seven for MEMA), demonstrating its robustness and versatility. Besides, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Our code and data are available at: https://github.com/zjukg/SNAG.
comment: Ongoing work; 10 pages, 6 Tables, 2 Figures; Repo is available at https://github.com/zjukg/SNAG
♻ ☆ In Search of Truth: An Interrogation Approach to Hallucination Detection
Despite the many advances of Large Language Models (LLMs) and their unprecedented rapid evolution, their impact and integration into every facet of our daily lives is limited due to various reasons. One critical factor hindering their widespread adoption is the occurrence of hallucinations, where LLMs invent answers that sound realistic, yet drift away from factual truth. In this paper, we present a novel method for detecting hallucinations in large language models, which tackles a critical issue in the adoption of these models in various real-world scenarios. Through extensive evaluations across multiple datasets and LLMs, including Llama-2, we study the hallucination levels of various recent LLMs and demonstrate the effectiveness of our method to automatically detect them. Notably, we observe up to 62% hallucinations for Llama-2 in a specific experiment, where our method achieves a Balanced Accuracy (B-ACC) of 87%, all without relying on external knowledge.
♻ ☆ SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes SemEval 2024
This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling. The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 27 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled -- many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.
comment: SemEval 2024 shared task. Pre-review version
♻ ☆ BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction ICLR2024
As a novel and effective fine-tuning paradigm based on large-scale pre-trained language models (PLMs), prompt-tuning aims to reduce the gap between downstream tasks and pre-training objectives. While prompt-tuning has yielded continuous advancements in various tasks, such an approach still remains a persistent defect: prompt-tuning methods fail to generalize to specific few-shot patterns. From the perspective of distribution analyses, we disclose that the intrinsic issues behind the phenomenon are the over-multitudinous conceptual knowledge contained in PLMs and the abridged knowledge for target downstream domains, which jointly result in that PLMs mis-locate the knowledge distributions corresponding to the target domains in the universal knowledge embedding space. To this end, we intuitively explore to approximate the unabridged target domains of downstream tasks in a debiased manner, and then abstract such domains to generate discriminative prompts, thereby providing the de-ambiguous guidance for PLMs. Guided by such an intuition, we propose a simple yet effective approach, namely BayesPrompt, to learn prompts that contain the domain discriminative information against the interference from domain-irrelevant knowledge. BayesPrompt primitively leverages known distributions to approximate the debiased factual distributions of target domains and further uniformly samples certain representative features from the approximated distributions to generate the ultimate prompts for PLMs. We provide theoretical insights with the connection to domain adaptation. Empirically, our method achieves state-of-the-art performance on benchmarks.
comment: Accepted by ICLR2024
♻ ☆ PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning NAACL 2024
While large language models (LLMs) excel in various natural language processing tasks, their huge size and the inaccessibility of parameters present challenges for practical deployment. Previous studies try to distill task-specific ability from LLMs to smaller models, using data synthesis and chain-of-thought (CoT) fine-tuning. However, synthetic CoT data often contains faulty reasoning, which deteriorates the quality of distillation, especially in reasoning capabilities. In this work, we propose Program-aided Distillation (PaD), which introduces reasoning programs to suppress the errors in distilled data, and thus achieves better distillation quality for reasoning tasks. In PaD, we utilize the reasoning program to substitute the CoT, allowing automated error checking of synthetic data. Further, through error injecting and further training, the small distilling model could iteratively self-refine the reasoning. Moreover, we conduct a step-wise beam search by step-by-step verifying to acquire more exact reasoning chains. We evaluate PaD on arithmetic reasoning, symbolic reasoning, and general ability. Experimental results demonstrate that smaller models using PaD can not only outperform certain LLMs~(e.g., LLaMA-1 13B) but also achieve strong improvement over baselines with a significantly smaller scale of parameters and data. The source code is publicly available at https://github.com/Xuekai-Zhu/pad.
comment: NAACL 2024 Long Paper; Code and data are available at https://github.com/Xuekai-Zhu/pad
♻ ☆ ChEDDAR: Student-ChatGPT Dialogue in EFL Writing Education
The integration of generative AI in education is expanding, yet empirical analyses of large-scale, real-world interactions between students and AI systems still remain limited. In this study, we present ChEDDAR, ChatGPT & EFL Learner's Dialogue Dataset As Revising an essay, which is collected from a semester-long longitudinal experiment involving 212 college students enrolled in English as Foreign Langauge (EFL) writing courses. The students were asked to revise their essays through dialogues with ChatGPT. ChEDDAR includes a conversation log, utterance-level essay edit history, self-rated satisfaction, and students' intent, in addition to session-level pre-and-post surveys documenting their objectives and overall experiences. We analyze students' usage patterns and perceptions regarding generative AI with respect to their intent and satisfaction. As a foundational step, we establish baseline results for two pivotal tasks in task-oriented dialogue systems within educational contexts: intent detection and satisfaction estimation. We finally suggest further research to refine the integration of generative AI into education settings, outlining potential scenarios utilizing ChEDDAR. ChEDDAR is publicly available at https://github.com/zeunie/ChEDDAR.
comment: The new version of this paper is on arXiv as arXiv:2403.08272
♻ ☆ CharPoet: A Chinese Classical Poetry Generation System Based on Token-free LLM
Automatic Chinese classical poetry generation has attracted much research interest, but achieving effective control over format and content simultaneously remains challenging. Traditional systems usually accept keywords as user inputs, resulting in limited control over content. Large language models (LLMs) improve content control by allowing unrestricted user instructions, but the token-by-token generation process frequently makes format errors. Motivated by this, we propose CharPoet, a Chinese classical poetry generation system based on token-free LLM, which provides effective control over both format and content. Our token-free architecture generates in a character-by-character manner, enabling precise control over the number of characters. Pruned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like "Write me a poem for my mother's birthday." CharPoet achieves format accuracy above 0.96, outperforming Jiuge-GPT-2 (0.91) and GPT-4 (0.38). In terms of content quality, CharPoet surpasses traditional systems including Jiuge, and is comparable to other LLMs. Our system is open source and available at https://modelscope.cn/models/CharPoet/CharPoet. A video demonstration of CharPoet is available at https://youtu.be/voZ25qEp3Dc.
♻ ☆ Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training
While large language models (LLMs) have achieved impressive performance across diverse tasks, recent studies showcase that causal LLMs suffer from the "reversal curse". It is a typical example that the model knows "A's father is B", but is unable to reason "B's child is A". This limitation poses a challenge to the advancement of artificial general intelligence (AGI), as it suggests a gap in the models' ability to comprehend and apply bidirectional reasoning. In this paper, we first conduct substantial evaluation and identify that the root cause of the reversal curse lies in the different word order between the training and inference stage, namely, the poor ability of causal language models to predict antecedent words within the training data. Accordingly, permutation on the training data is considered as a potential solution, since this can make the model predict antecedent words or tokens. However, previous permutation methods may disrupt complete phrases or entities, thereby posing challenges for the model to comprehend and learn from training data. To address this issue, we propose Semantic-aware Permutation Training (SPT), which addresses this issue by segmenting the training sentences into semantic units (i.e., entities or phrases) with an assistant language model and permuting these units before feeding into the model. Extensive experiments demonstrate that SPT effectively mitigates the reversal curse since the performance on reversed questions approximates that on the forward ones, and significantly advances the performance of existing works.
♻ ☆ Over-Reasoning and Redundant Calculation of Large Language Models EACL 2024
Large language models (LLMs) can solve problems step-by-step. While this chain-of-thought (CoT) reasoning boosts LLMs' performance, it is unclear if LLMs \textit{know} when to use CoT and whether those CoT are always necessary to answer the question. This paper shows that LLMs tend to generate redundant calculations and reasoning on a manually constructed math QA dataset, GSM8K-Zero. GSM8K-Zero is constructed such that the questions can be answered without any calculations, but LLMs, including Llama-2 models and Claude-2, tend to generate lengthy and unnecessary calculations to answer the questions. We also conduct experiments to explain why LLMs generate redundant calculations and reasonings. GSM8K-Zero is publicly available at https://github.com/d223302/Over-Reasoning-of-LLMs and https://huggingface.co/datasets/dcml0714/GSM8K-Zero.
comment: EACL 2024 main conference paper. Camera-ready version
♻ ☆ AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \textit{AgentOhana} aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It meticulously standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present \textbf{xLAM-v0.1}, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks. Begin the exploration at \url{https://github.com/SalesforceAIResearch/xLAM}.
comment: Add GitHub repo link at \url{https://github.com/SalesforceAIResearch/xLAM} and HuggingFace model link at \url{https://huggingface.co/Salesforce/xLAM-v0.1-r}
♻ ☆ LLatrieval: LLM-Verified Retrieval for Verifiable Generation NAACL 2024
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents, which enables the user to flexibly verify the answer and makes the LLM's output more reliable. Retrieval plays a crucial role in verifiable generation. Specifically, the retrieved documents not only supplement knowledge to help the LLM generate correct answers, but also serve as supporting evidence for the user to verify the LLM's output. However, the widely used retrievers become the bottleneck of the entire pipeline and limit the overall performance. Their capabilities are usually inferior to LLMs since they often have much fewer parameters than the large language model and have not been demonstrated to scale well to the size of LLMs. If the retriever does not correctly find the supporting documents, the LLM can not generate the correct and verifiable answer, which overshadows the LLM's remarkable abilities. To address these limitations, we propose \LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question. Thus, the LLM can iteratively provide feedback to retrieval and facilitate the retrieval result to fully support verifiable generation. Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
comment: Accepted by NAACL 2024 (Main Conference)
♻ ☆ Bidirectional End-to-End Learning of Retriever-Reader Paradigm for Entity Linking
Entity Linking (EL) is a fundamental task for Information Extraction and Knowledge Graphs. The general form of EL (i.e., end-to-end EL) aims to first find mentions in the given input document and then link the mentions to corresponding entities in a specific knowledge base. Recently, the paradigm of retriever-reader promotes the progress of end-to-end EL, benefiting from the advantages of dense entity retrieval and machine reading comprehension. However, the existing study only trains the retriever and the reader separately in a pipeline manner, which ignores the benefit that the interaction between the retriever and the reader can bring to the task. To advance the retriever-reader paradigm to perform more perfectly on end-to-end EL, we propose BEER$^2$, a Bidirectional End-to-End training framework for Retriever and Reader. Through our designed bidirectional end-to-end training, BEER$^2$ guides the retriever and the reader to learn from each other, make progress together, and ultimately improve EL performance. Extensive experiments on benchmarks of multiple domains demonstrate the effectiveness of our proposed BEER$^2$.
♻ ☆ Exploring semantic information in disease: Simple Data Augmentation Techniques for Chinese Disease Normalization
Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. While data augmentation is a powerful approach for addressing data scarcity, our findings reveal that conventional data augmentation techniques often impede task performance, primarily due to the multi-axis and multi-granularity nature of disease names. Consequently, we introduce a set of customized data augmentation techniques designed to leverage the semantic information inherent in disease names. These techniques aim to enhance the model's understanding of the semantic intricacies and classification structure of disease names. Through extensive experimentation, we illustrate that our proposed plug-and-play methods not only surpass general data augmentation techniques but also exhibit significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data. This underscores its potential for widespread application in medical language processing tasks.
♻ ☆ Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions EACL 2023
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.
comment: EACL 2023 (Outstanding Paper Award)
♻ ☆ SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning LREC
Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently.
comment: Accepted by LREC-COLING 2024
♻ ☆ Can Whisper perform speech-based in-context learning? ICASSP 2024
This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved using Whisper models of any size on two dialects, which is on average 32.3%. A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%. The findings are verified using speaker adaptation or continuous speech recognition tasks, and both achieved considerable relative WER reductions. Detailed quantitative analyses are also provided to shed light on SICL's adaptability to phonological variances and dialect-specific lexical nuances.
comment: Accepted by ICASSP 2024
♻ ☆ MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations ICLR 2024
Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
comment: Published in ICLR 2024; The abstract is slightly modified due to the length limitation
♻ ☆ Calibrated Language Models Must Hallucinate STOC
Recent language models generate false but plausible-sounding text with surprising frequency. Such "hallucinations" are an obstacle to the usability of language-based AI systems and can harm people who rely upon their outputs. This work shows that there is an inherent statistical lower-bound on the rate that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For "arbitrary" facts whose veracity cannot be determined from the training data, we show that hallucinations must occur at a certain rate for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a "Good-Turing" estimate), even assuming ideal training data without errors. One conclusion is that models pretrained to be sufficiently good predictors (i.e., calibrated) may require post-training to mitigate hallucinations on the type of arbitrary facts that tend to appear once in the training set. However, our analysis also suggests that there is no statistical reason that pretraining will lead to hallucination on facts that tend to appear more than once in the training data (like references to publications such as articles and books, whose hallucinations have been particularly notable and problematic) or on systematic facts (like arithmetic calculations). Therefore, different architectures and learning algorithms may mitigate these latter types of hallucinations.
comment: In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC) 2024
♻ ☆ Unimodal Aggregation for CTC-based Speech Recognition ICASSP 2024
This paper works on non-autoregressive automatic speech recognition. A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Compared to the regular CTC, the proposed method learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity. Experiments on three Mandarin datasets show that UMA demonstrates superior or comparable performance to other advanced non-autoregressive methods, such as self-conditioned CTC. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.
comment: Accepted by ICASSP 2024
♻ ☆ PsyChat: A Client-Centric Dialogue System for Mental Health Support SC
Dialogue systems are increasingly integrated into mental health support to help clients facilitate exploration, gain insight, take action, and ultimately heal themselves. A practical and user-friendly dialogue system should be client-centric, focusing on the client's behaviors. However, existing dialogue systems publicly available for mental health support often concentrate solely on the counselor's strategies rather than the behaviors expressed by clients. This can lead to unreasonable or inappropriate counseling strategies and corresponding responses generated by the dialogue system. To address this issue, we propose PsyChat, a client-centric dialogue system that provides psychological support through online chat. The client-centric dialogue system comprises five modules: client behavior recognition, counselor strategy selection, input packer, response generator, and response selection. Both automatic and human evaluations demonstrate the effectiveness and practicality of our proposed dialogue system for real-life mental health support. Furthermore, the case study demonstrates that the dialogue system can predict the client's behaviors, select appropriate counselor strategies, and generate accurate and suitable responses.
comment: Accepted to CSCWD 2024 (27th International Conference on Computer Supported Cooperative Work in Design)
♻ ☆ Generative Multimodal Entity Linking LREC
Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to the referent entities from a knowledge base. Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters, which can be prohibitively costly and difficult to scale in the era of Large Language Models (LLMs). In this work, we propose GEMEL, a Generative Multimodal Entity Linking framework based on LLMs, which directly generates target entity names. We keep the vision and language model frozen and only train a feature mapper to enable cross-modality interactions. To adapt LLMs to the MEL task, we leverage the in-context learning capability of LLMs by retrieving multimodal instances as demonstrations. Extensive experiments show that, with only ~0.3% of the model parameters fine-tuned, GEMEL achieves state-of-the-art results on two well-established MEL datasets (7.7% accuracy gains on WikiDiverse and 8.8% accuracy gains on WikiMEL). The performance gain stems from mitigating the popularity bias of LLM predictions and disambiguating less common entities effectively. Further analysis verifies the generality and scalability of GEMEL. Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution for utilizing LLMs in the MEL task. Our code is available at https://github.com/HITsz-TMG/GEMEL.
comment: Accepted by LREC-COLING 2024
♻ ☆ Advancing Beyond Identification: Multi-bit Watermark for Large Language Models NAACL 2024
We show the viability of tackling misuses of large language models beyond the identification of machine-generated text. While existing zero-bit watermark methods focus on detection only, some malicious misuses demand tracing the adversary user for counteracting them. To address this, we propose Multi-bit Watermark via Position Allocation, embedding traceable multi-bit information during language model generation. Through allocating tokens onto different parts of the messages, we embed longer messages in high corruption settings without added latency. By independently embedding sub-units of messages, the proposed method outperforms the existing works in terms of robustness and latency. Leveraging the benefits of zero-bit watermarking, our method enables robust extraction of the watermark without any model access, embedding and extraction of long messages ($\geq$ 32-bit) without finetuning, and maintaining text quality, while allowing zero-bit detection all at the same time. Code is released here: https://github.com/bangawayoo/mb-lm-watermarking
comment: NAACL 2024 main. 9 pages and appendix
♻ ☆ Prompt Highlighter: Interactive Control for Multi-Modal LLMs CVPR 2024
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 70.7 in the MMBench test and 1552.5 in MME-perception. The code is available at: https://github.com/dvlab-research/Prompt-Highlighter/
comment: CVPR 2024; Project Page: https://julianjuaner.github.io/projects/PromptHighlighter
♻ ☆ Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging
The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such large models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world applications. Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications. Moreover, pragmatic issues such as access, cost, latency, and compliance make it hard for clinicians to use privately-hosted state-of-the-art large models directly on private patient data. In this paper, we explore training open-source small multimodal models (SMMs) to bridge biomedical competency gaps for unmet clinical needs. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space. We conduct a comprehensive study of this approach on radiology imaging. For training, we assemble a large dataset with over 1 million image-text pairs. For evaluation, we propose a clinically driven novel approach using GPT-4 and demonstrate its parity with expert evaluation. We also study grounding qualitatively using attention. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LLaVA-Rad (7B) model attains state-of-the-art results on radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). LLaVA-Rad is fast and can be run on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
♻ ☆ AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models ICLR 2024
The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
comment: Published as a conference paper at ICLR 2024. Code is available at https://github.com/SheltonLiu-N/AutoDAN
♻ ☆ Moral Judgments in Narratives on Reddit: Investigating Moral Sparks via Social Commonsense and Linguistic Signals
Machine ethics ensures ethical conduct in Artificial Intelligence (AI) models and agents. Examining real-life applications benefit learning practical ethics in many situations, offering valuable data to grasp the complexities of human ethics in diverse contexts. In this paper, we examine social media platforms for understanding real-life ethical scenarios and human moral judgments. We examine posts from a popular Reddit subreddit (i.e., a subcommunity) called r/AmITheAsshole, where authors and commenters share their moral judgments on who is blameworthy. We employ computational techniques to investigate the underlying reasoning influencing moral judgments. We focus on excerpts-which we term moral sparks-from original posts that commenters include to indicate what motivates their judgments. To this end, we examine how (1) events activating social commonsense and (2) linguistic signals affect moral sparks assignment and their subsequent judgments. By examining over 24 672 posts and 175988 comments, we find that event-related negative character traits (e.g., immature and rude) attract attention and stimulate blame, implying a dependent relationship between character traits and moral values. Specially, we focus on causal graph involving events (c-events) that activate social commonsense. We observe that c-events are perceived with varying levels of informativeness, influencing moral spark and judgment assignment in distinct ways. This observation is reinforced by examining linguistic features describing semantically similar c-events. Moreover, language influencing commenters' cognitive processes enhances the probability of an excerpt becoming a moral spark, while factual and concrete descriptions tend to inhibit this effect.
♻ ☆ RoDia: A New Dataset for Romanian Dialect Identification from Speech NAACL 2024
We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
comment: Accepted at NAACL 2024
♻ ☆ Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
Recent advances in large language models (LLM) have enabled richer social simulations, allowing for the study of various social phenomena with LLM-based agents. However, most work has used an omniscient perspective on these simulations (e.g., single LLM to generate all interlocutors), which is fundamentally at odds with the non-omniscient, information asymmetric interactions that humans have. To examine these differences, we develop an evaluation framework to simulate social interactions with LLMs in various settings (omniscient, non-omniscient). Our experiments show that interlocutors simulated omnisciently are much more successful at accomplishing social goals compared to non-omniscient agents, despite the latter being the more realistic setting. Furthermore, we demonstrate that learning from omniscient simulations improves the apparent naturalness of interactions but scarcely enhances goal achievement in cooperative scenarios. Our findings indicate that addressing information asymmetry remains a fundamental challenge for LLM-based agents.
♻ ☆ Metacognitive Prompting Improves Understanding in Large Language Models NAACL 2024
In Large Language Models (LLMs), there have been consistent advancements in task-specific performance, largely influenced by effective prompt design. Recent advancements in prompting have enhanced reasoning in logic-intensive tasks for LLMs, yet the nuanced understanding abilities of these models, crucial for processing and interpreting complex information, remain underexplored. In this study, we introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. Using MP, LLMs undergo a systematic series of structured, self-aware evaluations, drawing on both their vast inherent knowledge and new insights. We conduct extensive experiments on four prevalent LLMs: Llama2, PaLM2, GPT-3.5, and GPT-4, across ten natural language understanding (NLU) datasets from GLUE, SuperGLUE, BLUE, and LexGLUE benchmarks. Additionally, we compare our method with chain-of-thought prompting and its advanced versions. The results show that GPT-4 consistently excels across all tasks, while other models have shown significant progress in some tasks when used in conjunction with MP. Furthermore, MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks. This study underscores the potential to amplify the understanding abilities of LLMs and highlights the benefits of mirroring human introspective reasoning in NLU tasks.
comment: NAACL 2024
♻ ☆ The LLM Surgeon
State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to large language models. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of large language models.
♻ ☆ Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification
The reasoning capabilities of LLMs are currently hotly debated. We examine the issue from the perspective of claim/rumour verification. We propose the first logical reasoning framework designed to break down any claim or rumour paired with evidence into the atomic reasoning steps necessary for verification. Based on our framework, we curate two annotated collections of such claim/evidence pairs: a synthetic dataset from Wikipedia and a real-world set stemming from rumours circulating on Twitter. We use them to evaluate the reasoning capabilities of GPT-3.5-Turbo and GPT-4 (hereinafter referred to as ChatGPT) within the context of our framework, providing a thorough analysis. Our results show that ChatGPT struggles in abductive reasoning, although this can be somewhat mitigated by using manual Chain of Thought (CoT) as opposed to Zero-Shot (ZS) and ZS CoT approaches. Our study contributes to the growing body of research suggesting that ChatGPT's reasoning processes are unlikely to mirror human-like reasoning, and that LLMs need to be more rigorously evaluated to distinguish between hype and actual capabilities, especially in high-stakes real-world tasks such as claim verification.
comment: 19 pages, 1 figure
♻ ☆ Multi-Scale Contrastive Knowledge Co-Distillation for Event Temporal Relation Extraction
Event Temporal Relation Extraction (ETRE) is a crucial yet challenging problem. Event pairs are situated within a discourse at different distances, which we refer to as proximity bands. The temporal ordering communicated about event pairs situated at more remote (i.e., ``long'') or less remote (i.e., ``short'') proximity bands is encoded differently. SOTA ETRE models have tended to perform well on events situated at either short or long proximity bands, but not both. Yet, real-world, natural texts contain all types of temporal event-pairs. In this paper, we present MulCo: Multi-Scale Contrastive Knowledge Co-Distillation, a fusion approach that shares knowledge across multiple event pair proximity bands in order to improve performance on all types of temporal datasets. Our experimental results show that MulCo successfully integrates linguistic cues pertaining to temporal reasoning across both short and long proximity bands and achieves new state-of-the-art results on several ETRE benchmark datasets.
comment: update
♻ ☆ Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment
Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.
comment: 12 pages, 8 figures. Iran J Comput Sci (2024)
♻ ☆ OSCaR: Object State Captioning and State Change Representation NAACL 2024
The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.
comment: NAACL 2024
Software Engineering
☆ The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI
Generative AI (GAI) offers unprecedented possibilities but its commercialization has raised concerns about transparency, reproducibility, bias, and safety. Many "open-source" GAI models lack the necessary components for full understanding and reproduction, and some use restrictive licenses, a practice known as "openwashing." We propose the Model Openness Framework (MOF), a ranked classification system that rates machine learning models based on their completeness and openness, following principles of open science, open source, open data, and open access. The MOF requires specific components of the model development lifecycle to be included and released under appropriate open licenses. This framework aims to prevent misrepresentation of models claiming to be open, guide researchers and developers in providing all model components under permissive licenses, and help companies, academia, and hobbyists identify models that can be safely adopted without restrictions. Wide adoption of the MOF will foster a more open AI ecosystem, accelerating research, innovation, and adoption.
comment: 45 pages
☆ Reinforcement Learning for Online Testing of Autonomous Driving Systems: a Replication and Extension Study
In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting or useless feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random testing. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing.
☆ HyLiMo: A Hybrid Live-Synchronized Modular Diagramming Editor as IDE Extension for Technical and Scientific Publications ICSE'24
Creating suitable diagrams for technical and scientific publications is challenging and time-consuming, as manual control over the layout is required to communicate information effectively. Existing diagramming tools usually allow modeling the diagrams via a textual domain-specific language (DSL) that can be rendered and auto-layouted or via a graphical editor. While auto-layout is fast, the results are often not satisfying for most publications. However, graphical editors are time-consuming to create large diagrams. The blended or hybrid modeling concept enables creating diagrams efficiently using a DSL and editing the rendered diagram via the graphical editor for fine-tuning. However, hybrid modeling editors are limited to individual diagram types and do not save the layout and style information in the textual description. Therefore, we propose HyLiMo, a hybrid live-synchronized modular diagramming editor. In HyLiMo, diagrams are created using an internal DSL and live synchronized with an interactive graphical editor for the rendered diagram, allowing a straightforward layout and style change, which is stored in the DSL code. HyLiMo is independent of specific diagram types, but we offer specific functionality for UML class diagrams. Using the language server protocol, we implement it as a web app and IDE extension. The results of our user study indicate that such an approach enables fast and precise diagramming.
comment: 6 pages, 5 figures, ICSE'24 IDE Workshop
☆ MotorEase: Automated Detection of Motor Impairment Accessibility Issues in Mobile App UIs ICSE 2024
Recent research has begun to examine the potential of automatically finding and fixing accessibility issues that manifest in software. However, while recent work makes important progress, it has generally been skewed toward identifying issues that affect users with certain disabilities, such as those with visual or hearing impairments. However, there are other groups of users with different types of disabilities that also need software tooling support to improve their experience. As such, this paper aims to automatically identify accessibility issues that affect users with motor-impairments. To move toward this goal, this paper introduces a novel approach, called MotorEase, capable of identifying accessibility issues in mobile app UIs that impact motor-impaired users. Motor-impaired users often have limited ability to interact with touch-based devices, and instead may make use of a switch or other assistive mechanism -- hence UIs must be designed to support both limited touch gestures and the use of assistive devices. MotorEase adapts computer vision and text processing techniques to enable a semantic understanding of app UI screens, enabling the detection of violations related to four popular, previously unexplored UI design guidelines that support motor-impaired users, including: (i) visual touch target size, (ii) expanding sections, (iii) persisting elements, and (iv) adjacent icon visual distance. We evaluate MotorEase on a newly derived benchmark, called MotorCheck, that contains 555 manually annotated examples of violations to the above accessibility guidelines, across 1599 screens collected from 70 applications via a mobile app testing tool. Our experiments illustrate that MotorEase is able to identify violations with an average accuracy of ~90%, and a false positive rate of less than 9%, outperforming baseline techniques.
comment: Accepted to ICSE 2024 Research Track, 13 pages
☆ Genetic Auto-prompt Learning for Pre-trained Code Intelligence Language Models
As Pre-trained Language Models (PLMs), a popular approach for code intelligence, continue to grow in size, the computational cost of their usage has become prohibitively expensive. Prompt learning, a recent development in the field of natural language processing, emerges as a potential solution to address this challenge. In this paper, we investigate the effectiveness of prompt learning in code intelligence tasks. We unveil its reliance on manually designed prompts, which often require significant human effort and expertise. Moreover, we discover existing automatic prompt design methods are very limited to code intelligence tasks due to factors including gradient dependence, high computational demands, and limited applicability. To effectively address both issues, we propose Genetic Auto Prompt (GenAP), which utilizes an elaborate genetic algorithm to automatically design prompts. With GenAP, non-experts can effortlessly generate superior prompts compared to meticulously manual-designed ones. GenAP operates without the need for gradients or additional computational costs, rendering it gradient-free and cost-effective. Moreover, GenAP supports both understanding and generation types of code intelligence tasks, exhibiting great applicability. We conduct GenAP on three popular code intelligence PLMs with three canonical code intelligence tasks including defect prediction, code summarization, and code translation. The results suggest that GenAP can effectively automate the process of designing prompts. Specifically, GenAP outperforms all other methods across all three tasks (e.g., improving accuracy by an average of 2.13% for defect prediction). To the best of our knowledge, GenAP is the first work to automatically design prompts for code intelligence PLMs.
☆ CONLINE: Complex Code Generation and Refinement with Online Searching and Correctness Testing
Large Language Models (LLMs) have revolutionized code generation ability by converting natural language descriptions into executable code. However, generating complex code within real-world scenarios remains challenging due to intricate structures, subtle bugs, understanding of advanced data types, and lack of supplementary contents. To address these challenges, we introduce the CONLINE framework, which enhances code generation by incorporating planned online searches for information retrieval and automated correctness testing for iterative refinement. CONLINE also serializes the complex inputs and outputs to improve comprehension and generate test case to ensure the framework's adaptability for real-world applications. CONLINE is validated through rigorous experiments on the DS-1000 and ClassEval datasets. It shows that CONLINE substantially improves the quality of complex code generation, highlighting its potential to enhance the practicality and reliability of LLMs in generating intricate code.
☆ Specification Mining for Smart Contracts with Trace Slicing and Predicate Abstraction
Smart contracts are computer programs running on blockchains to implement Decentralized Applications.The absence of contract specifications hinders routine tasks, such as contract understanding and testing. Inthis work, we propose a specification mining approach to infer contract specifications from past transactionhistories. Our approach derives high-level behavioral automata of function invocations, accompanied byprogram invariants statistically inferred from the transaction histories. We implemented our approach as toolSmConand evaluated it on eleven well-studied Azure benchmark smart contracts and six popular real-worldDApp smart contracts. The experiments show thatSmConmines reasonably accurate specifications that canbe used to facilitate DApp understanding and development in terms of document maintenance and test suite improvement.
☆ Enhancing Code Generation Performance of Smaller Models by Distilling the Reasoning Ability of LLMs LREC
Large Language Models (LLMs) have recently made significant advances in code generation through the 'Chain-of-Thought' prompting technique. This technique empowers the model to autonomously devise "solution plans" to tackle intricate programming challenges, thereby improving its performance in code generation. Nevertheless, smaller models have been struggling to keep up with LLMs in deducing these plans, adversely affecting their code generation capabilities. Given the considerable size and associated deployment costs, along with concerns about data security, many teams opt for deploying smaller models for code generation. Consequently, there arises a compelling need for transferring LLMs' code generation reasoning abilities to the smaller models. In this paper, we propose the CodePLAN framework, which aims to transfer LLMs' reasoning capabilities to smaller models through distillation. We adopt a multi-task learning approach, jointly undertaking code generation and solution plan generation tasks, to enhance the code generation capabilities of the smaller model. To ensure the superior quality of the solution plans, we advocate for the utilization of backward reasoning and plan sampling strategies. Our experiments show that in comparison to the conventional fine-tuning approach, our approach improves the smaller model's code generation performance (measured in pass@1 metric) by over 130% on the challenging APPS benchmark.
comment: Accepted for LREC-COLING 2024
☆ Creative and Correct: Requesting Diverse Code Solutions from AI Foundation Models
AI foundation models have the capability to produce a wide array of responses to a single prompt, a feature that is highly beneficial in software engineering to generate diverse code solutions. However, this advantage introduces a significant trade-off between diversity and correctness. In software engineering tasks, diversity is key to exploring design spaces and fostering creativity, but the practical value of these solutions is heavily dependent on their correctness. Our study systematically investigates this trade-off using experiments with HumanEval tasks, exploring various parameter settings and prompting strategies. We assess the diversity of code solutions using similarity metrics from the code clone community. The study identifies combinations of parameters and strategies that strike an optimal balance between diversity and correctness, situated on the Pareto front of this trade-off space. These findings offer valuable insights for software engineers on how to effectively use AI foundation models to generate code solutions that are diverse and accurate.
comment: 4 pages,Forge 2024
☆ Elevating Software Quality in Agile Environments: The Role of Testing Professionals in Unit Testing
Testing is an essential quality activity in the software development process. Usually, a software system is tested on several levels, starting with unit testing that checks the smallest parts of the code until acceptance testing, which is focused on the validations with the end-user. Historically, unit testing has been the domain of developers, who are responsible for ensuring the accuracy of their code. However, in agile environments, testing professionals play an integral role in various quality improvement initiatives throughout each development cycle. This paper explores the participation of test engineers in unit testing within an industrial context, employing a survey-based research methodology. Our findings demonstrate that testing professionals have the potential to strengthen unit testing by collaborating with developers to craft thorough test cases and fostering a culture of mutual learning and cooperation, ultimately contributing to increasing the overall quality of software projects.
☆ Pricing-driven Development and Operation of SaaS : Challenges and Opportunities
As the Software as a Service (SaaS) paradigm continues to reshape the software industry, a nuanced understanding of its operational dynamics becomes increasingly crucial. This paper delves into the intricate relationship between pricing strategies and software development within the SaaS model. Using PetClinic as a case study, we explore the implications of a Pricing-driven Development and Operation approach of SaaS systems, highlighting the delicate balance between business-driven decision-making and technical implementation challenges, shedding light on how pricing plans can shape software features and deployment. Our discussion aims to provide strategic insights for the community to navigate the complexities of this integrated approach, fostering a better alignment between business models and technological capabilities for effective cloud-based services.
comment: JCIS, 10 pages, 5 figures
☆ Pricing4SaaS: a suite of software libraries for pricing-driven feature toggling
As the digital marketplace evolves, the ability to dynamically adjust or disable features and services in response to market demands and pricing strategies becomes increasingly crucial for maintaining competitive advantage and enhancing user engagement. This paper introduces a novel suite of software libraries named Pricing4SaaS, designed to facilitate the implementation of pricing-driven feature toggles in both the front-end and back-end of SaaS systems, and discuss its architectural design principles. Including Pricing4React for front-end and Pricing4Java for back-end, the suite enables developers a streamlined and efficient approach to integrating feature toggles that can be controlled based on pricing plans, emphasizing centralized toggle management, and secure synchronization of the toggling state between the client and server. We also present a case study based on the popular Spring PetClinic project to illustrate how the suite can be leveraged to optimize developer productivity, avoiding technical debt, and improving operational efficiency.
comment: JCIS, 5 pages, 2 figures
☆ FastFlip: Compositional Error Injection Analysis
Instruction-level error injection analyses aim to find instructions where errors often lead to unacceptable outcomes like Silent Data Corruptions (SDCs). These analyses require significant time, which is especially problematic if developers wish to regularly analyze software that evolves over time. We present FastFlip, a combination of empirical error injection and symbolic SDC propagation analyses that enables fast, compositional error injection analysis of evolving programs. FastFlip calculates how SDCs propagate across program sections and correctly accounts for unexpected side effects that can occur due to errors. Using FastFlip, we analyze five benchmarks, plus two modified versions of each benchmark. FastFlip speeds up the analysis of incrementally modified programs by $3.2\times$ (geomean). FastFlip selects a set of instructions to protect against SDCs that minimizes the runtime cost of protection while protecting against a developer-specified target fraction of all SDC-causing errors.
☆ Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature
In the evolving landscape of clinical informatics, the integration and utilization of software tools developed through governmental funding represent a pivotal advancement in research and application. However, the dispersion of these tools across various repositories, with no centralized knowledge base, poses significant challenges to leveraging their full potential. This study introduces an automated methodology to bridge this gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv, focusing on the field of clinical informatics. Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors. The process is designed to be robust, incorporating error handling and rate limiting to ensure compliance with API constraints. Preliminary findings demonstrate the efficacy of this methodology in compiling a centralized knowledge base of NIH-funded software tools, laying the groundwork for an enriched understanding and utilization of these resources within the clinical informatics community. We propose the future integration of Large Language Models (LLMs) to generate concise summaries and evaluations of the tools. This approach facilitates the discovery and assessment of clinical informatics tools and also enables ongoing monitoring of new and actively updated repositories, revolutionizing how researchers access and leverage federally funded software. The implications of this study extend beyond simplification of access to valuable resources; it proposes a scalable model for the dynamic aggregation and evaluation of scientific software, encouraging more collaborative, transparent, and efficient research practices in clinical informatics and beyond.
♻ ☆ An Empirical Study of Fault Localization in Python Programs
Despite its massive popularity as a programming language, especially in novel domains like data science programs, there is comparatively little research about fault localization that targets Python. Even though it is plausible that several findings about programming languages like C/C++ and Java -- the most common choices for fault localization research -- carry over to other languages, whether the dynamic nature of Python and how the language is used in practice affect the capabilities of classic fault localization approaches remain open questions to investigate. This paper is the first multi-family large-scale empirical study of fault localization on real-world Python programs and faults. Using Zou et al.'s recent large-scale empirical study of fault localization in Java as the basis of our study, we investigated the effectiveness (i.e., localization accuracy), efficiency (i.e., runtime performance), and other features (e.g., different entity granularities) of seven well-known fault-localization techniques in four families (spectrum-based, mutation-based, predicate switching, and stack-trace based) on 135 faults from 13 open-source Python projects from the BugsInPy curated collection. The results replicate for Python several results known about Java, and shed light on whether Python's peculiarities affect the capabilities of fault localization. The replication package that accompanies this paper includes detailed data about our experiments, as well as the tool FauxPy that we implemented to conduct the study.
comment: Final version
♻ ☆ An Exploratory Study on Automatic Identification of Assumptions in the Development of Deep Learning Frameworks
Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs (e.g., time and resources). To overcome the issues of manually identifying assumptions in DL framework development, we constructed a new and largest dataset (i.e., AssuEval) of assumptions collected from the TensorFlow and Keras repositories on GitHub; explored the performance of seven traditional machine learning models (e.g., Support Vector Machine, Classification and Regression Trees), a popular DL model (i.e., ALBERT), and a large language model (i.e., ChatGPT) of identifying assumptions on the AssuEval dataset. The experiment results show that: ALBERT achieves the best performance (f1-score: 0.9584) of identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.6211, achieved by ChatGPT). Though ChatGPT is the most popular large language model, we do not recommend using it to identify assumptions in DL framework development because of its low performance on the task. Fine-tuning ChatGPT specifically for assumption identification could improve the performance. This study provides researchers with the largest dataset of assumptions for further research (e.g., assumption classification, evaluation, and reasoning) and helps practitioners better understand assumptions and how to manage them in their projects.
comment: 28 pages, 15 images, 10 tables, Manuscript submitted to a Journal (2024)
♻ ☆ Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
♻ ☆ Empirical Analysis on CI/CD Pipeline Evolution in Machine Learning Projects
The growing popularity of machine learning (ML) and the integration of ML components with other software artifacts has led to the use of continuous integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc. that enable faster integration and testing for ML projects. Such CI/CD configurations and services require synchronization during the life cycle of the projects. Several works discussed how CI/CD configuration and services change during their usage in traditional software systems. However, there is very limited knowledge of how CI/CD configuration and services change in ML projects. To fill this knowledge gap, this work presents the first empirical analysis of how CI/CD configuration evolves for ML software systems. We manually analyzed 343 commits collected from 508 open-source ML projects to identify common CI/CD configuration change categories in ML projects and devised a taxonomy of 14 co-changes in CI/CD and ML components. Moreover, we developed a CI/CD configuration change clustering tool that identified frequent CI/CD configuration change patterns in 15,634 commits. Furthermore, we measured the expertise of ML developers who modify CI/CD configurations. Based on this analysis, we found that 61.8% of commits include a change to the build policy and minimal changes related to performance and maintainability compared to general open-source projects. Additionally, the co-evolution analysis identified that CI/CD configurations, in many cases, changed unnecessarily due to bad practices such as the direct inclusion of dependencies and a lack of usage of standardized testing frameworks. More practices were found through the change patterns analysis consisting of using deprecated settings and reliance on a generic build language. Finally, our developer's expertise analysis suggests that experienced developers are more inclined to modify CI/CD configurations.
Programming Languages
☆ Enhancing Programming Education with ChatGPT: A Case Study on Student Perceptions and Interactions in a Python Course
The integration of ChatGPT as a supportive tool in education, notably in programming courses, addresses the unique challenges of programming education by providing assistance with debugging, code generation, and explanations. Despite existing research validating ChatGPT's effectiveness, its application in university-level programming education and a detailed understanding of student interactions and perspectives remain limited. This paper explores ChatGPT's impact on learning in a Python programming course tailored for first-year students over eight weeks. By analyzing responses from surveys, open-ended questions, and student-ChatGPT dialog data, we aim to provide a comprehensive view of ChatGPT's utility and identify both its advantages and limitations as perceived by students. Our study uncovers a generally positive reception toward ChatGPT and offers insights into its role in enhancing the programming education experience. These findings contribute to the broader discourse on AI's potential in education, suggesting paths for future research and application.
♻ ☆ Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
Symbolic Computation
☆ Hierarchical NeuroSymbolic Approach for Action Quality Assessment
Action quality assessment (AQA) applies computer vision to quantitatively assess the performance or execution of a human action. Current AQA approaches are end-to-end neural models, which lack transparency and tend to be biased because they are trained on subjective human judgements as ground-truth. To address these issues, we introduce a neuro-symbolic paradigm for AQA, which uses neural networks to abstract interpretable symbols from video data and makes quality assessments by applying rules to those symbols. We take diving as the case study. We found that domain experts prefer our system and find it more informative than purely neural approaches to AQA in diving. Our system also achieves state-of-the-art action recognition and temporal segmentation, and automatically generates a detailed report that breaks the dive down into its elements and provides objective scoring with visual evidence. As verified by a group of domain experts, this report may be used to assist judges in scoring, help train judges, and provide feedback to divers. We will open-source all of our annotated training data and code for ease of reproducibility.
☆ OSVAuto: semi-automatic verifier for functional specifications of operating systems
We present the design and implementation of a tool for semi-automatic verification of functional specifications of operating system modules. Such verification tasks are traditionally done in interactive theorem provers, where the functionalities of the module are specified at abstract and concrete levels using data such as structures, algebraic datatypes, arrays, maps and so on. In this work, we provide encodings to SMT for these commonly occurring data types. This allows verification conditions to be reduced into a form suitable for SMT solvers. The use of SMT solvers combined with a tactic language allows semi-automatic verification of the specification. We apply the tool to verify functional specification for key parts of the uC-OS/II operating system, based on earlier work giving full verification of the system in Coq. We demonstrate a large reduction in the amount of human effort due to increased level of automation.
☆ Gr{ö}bner bases over polytopal affinoid algebras
Polyhedral affinoid algebras have been introduced by Einsiedler, Kapranov and Lind to connect rigid analytic geometry (analytic geometry over non-archimedean fields) and tropical geometry.In this article, we present a theory of Gr{\"o}bner bases for polytopal affinoid algebras that extends both Caruso et al.'s theory of Gr{\"o}bner bases on Tate algebras and Pauer et al.'s theory of Gr{\"o}bner bases on Laurent polynomials.We provide effective algorithms to compute Gr{\"o}bner bases for both ideals of Laurent polynomials and ideals in polytopal affinoid algebras. Experiments with a Sagemath implementation are provided.
☆ Self-Attention Based Semantic Decomposition in Vector Symbolic Architectures
Vector Symbolic Architectures (VSAs) have emerged as a novel framework for enabling interpretable machine learning algorithms equipped with the ability to reason and explain their decision processes. The basic idea is to represent discrete information through high dimensional random vectors. Complex data structures can be built up with operations over vectors such as the "binding" operation involving element-wise vector multiplication, which associates data together. The reverse task of decomposing the associated elements is a combinatorially hard task, with an exponentially large search space. The main algorithm for performing this search is the resonator network, inspired by Hopfield network-based memory search operations. In this work, we introduce a new variant of the resonator network, based on self-attention based update rules in the iterative search problem. This update rule, based on the Hopfield network with log-sum-exp energy function and norm-bounded states, is shown to substantially improve the performance and rate of convergence. As a result, our algorithm enables a larger capacity for associative memory, enabling applications in many tasks like perception based pattern recognition, scene decomposition, and object reasoning. We substantiate our algorithm with a thorough evaluation and comparisons to baselines.
Computer Vision and Pattern Recognition
☆ On Pretraining Data Diversity for Self-Supervised Learning
We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models will be available at https://github.com/hammoudhasan/DiversitySSL .
comment: Under review
☆ Editing Massive Concepts in Text-to-Image Diffusion Models
Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.
comment: Project page: https://silentview.github.io/EMCID/ . Code: https://github.com/SilentView/EMCID
☆ RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.
comment: Project: https://github.com/Liuziyu77/RAR
☆ RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS
Recent advances in view synthesis and real-time rendering have achieved photorealistic quality at impressive rendering speeds. While Radiance Field-based methods achieve state-of-the-art quality in challenging scenarios such as in-the-wild captures and large-scale scenes, they often suffer from excessively high compute requirements linked to volumetric rendering. Gaussian Splatting-based methods, on the other hand, rely on rasterization and naturally achieve real-time rendering but suffer from brittle optimization heuristics that underperform on more challenging scenes. In this work, we present RadSplat, a lightweight method for robust real-time rendering of complex scenes. Our main contributions are threefold. First, we use radiance fields as a prior and supervision signal for optimizing point-based scene representations, leading to improved quality and more robust optimization. Next, we develop a novel pruning technique reducing the overall point count while maintaining high quality, leading to smaller and more compact scene representations with faster inference speeds. Finally, we propose a novel test-time filtering approach that further accelerates rendering and allows to scale to larger, house-sized scenes. We find that our method enables state-of-the-art synthesis of complex captures at 900+ FPS.
comment: Project page at https://m-niemeyer.github.io/radsplat/
☆ Learning from Models and Data for Visual Grounding
We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%.
comment: Project Page: https://catherine-r-he.github.io/SynGround/
☆ Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments ICLR 2024
Bounding boxes uniquely characterize object detection, where a good detector gives accurate bounding boxes of categories of interest. However, in the real-world where test ground truths are not provided, it is non-trivial to find out whether bounding boxes are accurate, thus preventing us from assessing the detector generalization ability. In this work, we find under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes. We compute the box stability score (BoS score) to reflect this stability. Specifically, given an image, we compute a normal set of bounding boxes and a second set after feature map dropout. To obtain BoS score, we use bipartite matching to find the corresponding boxes between the two sets and compute the average Intersection over Union (IoU) across the entire test set. We contribute to finding that BoS score has a strong, positive correlation with detection accuracy measured by mean average precision (mAP) under various test environments. This relationship allows us to predict the accuracy of detectors on various real-world test sets without accessing test ground truths, verified on canonical detection tasks such as vehicle detection and pedestrian detection. Code and data are available at https://github.com/YangYangGirl/BoS.
comment: ICLR 2024 spotlight
☆ ZigMa: Zigzag Mamba Diffusion Model
The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$. Code will be released at https://taohu.me/zigma/
comment: Project Page: https://taohu.me/zigma/
☆ TimeRewind: Rewinding Time with Image-and-Events Video Diffusion
This paper addresses the novel challenge of ``rewinding'' time from a single captured image to recover the fleeting moments missed just before the shutter button is pressed. This problem poses a significant challenge in computer vision and computational photography, as it requires predicting plausible pre-capture motion from a single static frame, an inherently ill-posed task due to the high degree of freedom in potential pixel movements. We overcome this challenge by leveraging the emerging technology of neuromorphic event cameras, which capture motion information with high temporal resolution, and integrating this data with advanced image-to-video diffusion models. Our proposed framework introduces an event motion adaptor conditioned on event camera data, guiding the diffusion model to generate videos that are visually coherent and physically grounded in the captured events. Through extensive experimentation, we demonstrate the capability of our approach to synthesize high-quality videos that effectively ``rewind'' time, showcasing the potential of combining event camera technology with generative models. Our work opens new avenues for research at the intersection of computer vision, computational photography, and generative modeling, offering a forward-thinking solution to capturing missed moments and enhancing future consumer cameras and smartphones. Please see the project page at https://timerewind.github.io/ for video results and code release.
☆ Hierarchical NeuroSymbolic Approach for Action Quality Assessment
Action quality assessment (AQA) applies computer vision to quantitatively assess the performance or execution of a human action. Current AQA approaches are end-to-end neural models, which lack transparency and tend to be biased because they are trained on subjective human judgements as ground-truth. To address these issues, we introduce a neuro-symbolic paradigm for AQA, which uses neural networks to abstract interpretable symbols from video data and makes quality assessments by applying rules to those symbols. We take diving as the case study. We found that domain experts prefer our system and find it more informative than purely neural approaches to AQA in diving. Our system also achieves state-of-the-art action recognition and temporal segmentation, and automatically generates a detailed report that breaks the dive down into its elements and provides objective scoring with visual evidence. As verified by a group of domain experts, this report may be used to assist judges in scoring, help train judges, and provide feedback to divers. We will open-source all of our annotated training data and code for ease of reproducibility.
☆ Bridge the Modality and Capacity Gaps in Vision-Language Model Selection
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. Thus, a promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" -- the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" -- the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps. SWAB first adopts optimal transport to capture the relevance between open-source datasets and target dataset with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging those two gaps and enhancing the VLM's capacity estimation for VLM selection. Experiments across various VLMs and image classification datasets validate SWAB's effectiveness.
☆ DepthFM: Fast Monocular Depth Estimation with Flow Matching
Monocular depth estimation is crucial for numerous downstream vision tasks and applications. Current discriminative approaches to this problem are limited due to blurry artifacts, while state-of-the-art generative methods suffer from slow sampling due to their SDE nature. Rather than starting from noise, we seek a direct mapping from input image to depth map. We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality. Our study demonstrates that a pre-trained image diffusion model can serve as an adequate prior for a flow matching depth model, allowing efficient training on only synthetic data to generalize to real images. We find that an auxiliary surface normals loss further improves the depth estimates. Due to the generative nature of our approach, our model reliably predicts the confidence of its depth estimates. On standard benchmarks of complex natural scenes, our lightweight approach exhibits state-of-the-art performance at favorable low computational cost despite only being trained on little synthetic data.
☆ Certified Human Trajectory Prediction
Trajectory prediction plays an essential role in autonomous vehicles. While numerous strategies have been developed to enhance the robustness of trajectory prediction models, these methods are predominantly heuristic and do not offer guaranteed robustness against adversarial attacks and noisy observations. In this work, we propose a certification approach tailored for the task of trajectory prediction. To this end, we address the inherent challenges associated with trajectory prediction, including unbounded outputs, and mutli-modality, resulting in a model that provides guaranteed robustness. Furthermore, we integrate a denoiser into our method to further improve the performance. Through comprehensive evaluations, we demonstrate the effectiveness of the proposed technique across various baselines and using standard trajectory prediction datasets. The code will be made available online: https://s-attack.github.io/
☆ Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models
In this paper, we propose Describe-and-Dissect (DnD), a novel method to describe the roles of hidden neurons in vision networks. DnD utilizes recent advancements in multimodal deep learning to produce complex natural language descriptions, without the need for labeled training data or a predefined set of concepts to choose from. Additionally, DnD is training-free, meaning we don't train any new models and can easily leverage more capable general purpose models in the future. We have conducted extensive qualitative and quantitative analysis to show that DnD outperforms prior work by providing higher quality neuron descriptions. Specifically, our method on average provides the highest quality labels and is more than 2 times as likely to be selected as the best explanation for a neuron than the best baseline.
☆ Towards Principled Representation Learning from Videos for Reinforcement Learning ICLR 2024
We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive learning and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that the sample complexity of learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, yielding results that are consistent with our theoretical findings.
comment: ICLR 2024 Spotlight Conference Paper
☆ Practical End-to-End Optical Music Recognition for Pianoform Music
The majority of recent progress in Optical Music Recognition (OMR) has been achieved with Deep Learning methods, especially models following the end-to-end paradigm, reading input images and producing a linear sequence of tokens. Unfortunately, many music scores, especially piano music, cannot be easily converted to a linear sequence. This has led OMR researchers to use custom linearized encodings, instead of broadly accepted structured formats for music notation. Their diversity makes it difficult to compare the performance of OMR systems directly. To bring recent OMR model progress closer to useful results: (a) We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly and maintaining close cohesion and compatibility with the industry-standard MusicXML format. (b) We create a dev and test set for benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus. They contain 1,438 and 1,493 pianoform systems, each with an image from IMSLP. (c) We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model. We also test our model against the recently published synthetic pianoform dataset GrandStaff and surpass the state-of-the-art results.
comment: 15+4 pages, 6 figures
☆ HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition
Text recognition, especially for complex scripts like Chinese, faces unique challenges due to its intricate character structures and vast vocabulary. Traditional one-hot encoding methods struggle with the representation of hierarchical radicals, recognition of Out-Of-Vocabulary (OOV) characters, and on-device deployment due to their computational intensity. To address these challenges, we propose HierCode, a novel and lightweight codebook that exploits the innate hierarchical nature of Chinese characters. HierCode employs a multi-hot encoding strategy, leveraging hierarchical binary tree encoding and prototype learning to create distinctive, informative representations for each character. This approach not only facilitates zero-shot recognition of OOV characters by utilizing shared radicals and structures but also excels in line-level recognition tasks by computing similarity with visual features, a notable advantage over existing methods. Extensive experiments across diverse benchmarks, including handwritten, scene, document, web, and ancient text, have showcased HierCode's superiority for both conventional and zero-shot Chinese character or text recognition, exhibiting state-of-the-art performance with significantly fewer parameters and fast inference speed.
☆ When Cars meet Drones: Hyperbolic Federated Learning for Source-Free Domain Adaptation in Adverse Weather
In Federated Learning (FL), multiple clients collaboratively train a global model without sharing private data. In semantic segmentation, the Federated source Free Domain Adaptation (FFreeDA) setting is of particular interest, where clients undergo unsupervised training after supervised pretraining at the server side. While few recent works address FL for autonomous vehicles, intrinsic real-world challenges such as the presence of adverse weather conditions and the existence of different autonomous agents are still unexplored. To bridge this gap, we address both problems and introduce a new federated semantic segmentation setting where both car and drone clients co-exist and collaborate. Specifically, we propose a novel approach for this setting which exploits a batch-norm weather-aware strategy to dynamically adapt the model to the different weather conditions, while hyperbolic space prototypes are used to align the heterogeneous client representations. Finally, we introduce FLYAWARE, the first semantic segmentation dataset with adverse weather data for aerial vehicles.
☆ Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model
We present a knowledge augmentation strategy for assessing the diagnostic groups and gait impairment from monocular gait videos. Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos, through a collective learning across three distinct modalities: gait videos, class-specific descriptions, and numerical gait parameters. Our specific contributions are two-fold: First, we adopt a knowledge-aware prompt tuning strategy to utilize the class-specific medical description in guiding the text prompt learning. Second, we integrate the paired gait parameters in the form of numerical texts to enhance the numeracy of the textual representation. Results demonstrate that our model not only significantly outperforms state-of-the-art (SOTA) in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions using the vocabulary of quantitative gait parameters. The code and the model will be made available at our project page.
☆ Leveraging High-Resolution Features for Improved Deep Hashing-based Image Retrieval
Deep hashing techniques have emerged as the predominant approach for efficient image retrieval. Traditionally, these methods utilize pre-trained convolutional neural networks (CNNs) such as AlexNet and VGG-16 as feature extractors. However, the increasing complexity of datasets poses challenges for these backbone architectures in capturing meaningful features essential for effective image retrieval. In this study, we explore the efficacy of employing high-resolution features learned through state-of-the-art techniques for image retrieval tasks. Specifically, we propose a novel methodology that utilizes High-Resolution Networks (HRNets) as the backbone for the deep hashing task, termed High-Resolution Hashing Network (HHNet). Our approach demonstrates superior performance compared to existing methods across all tested benchmark datasets, including CIFAR-10, NUS-WIDE, MS COCO, and ImageNet. This performance improvement is more pronounced for complex datasets, which highlights the need to learn high-resolution features for intricate image retrieval tasks. Furthermore, we conduct a comprehensive analysis of different HRNet configurations and provide insights into the optimal architecture for the deep hashing task
☆ Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation
Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA Mastering Video Outpainting Through Input-Specific Adaptation, a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies including spatial-aware insertion and noise travel are proposed to better leverage the diffusion model's generative prior and the acquired video patterns from source videos. Extensive evaluations underscore MOTIA's superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning.
comment: Code will be available at https://github.com/G-U-N/Be-Your-Outpainter
☆ DBA-Fusion: Tightly Integrating Deep Dense Visual Bundle Adjustment with Multiple Sensors for Large-Scale Localization and Mapping
Visual simultaneous localization and mapping (VSLAM) has broad applications, with state-of-the-art methods leveraging deep neural networks for better robustness and applicability. However, there is a lack of research in fusing these learning-based methods with multi-sensor information, which could be indispensable to push related applications to large-scale and complex scenarios. In this paper, we tightly integrate the trainable deep dense bundle adjustment (DBA) with multi-sensor information through a factor graph. In the framework, recurrent optical flow and DBA are performed among sequential images. The Hessian information derived from DBA is fed into a generic factor graph for multi-sensor fusion, which employs a sliding window and supports probabilistic marginalization. A pipeline for visual-inertial integration is firstly developed, which provides the minimum ability of metric-scale localization and mapping. Furthermore, other sensors (e.g., global navigation satellite system) are integrated for driftless and geo-referencing functionality. Extensive tests are conducted on both public datasets and self-collected datasets. The results validate the superior localization performance of our approach, which enables real-time dense mapping in large-scale environments. The code has been made open-source (https://github.com/GREAT-WHU/DBA-Fusion).
☆ Fostc3net:A Lightweight YOLOv5 Based On the Network Structure Optimization
Transmission line detection technology is crucial for automatic monitoring and ensuring the safety of electrical facilities. The YOLOv5 series is currently one of the most advanced and widely used methods for object detection. However, it faces inherent challenges, such as high computational load on devices and insufficient detection accuracy. To address these concerns, this paper presents an enhanced lightweight YOLOv5 technique customized for mobile devices, specifically intended for identifying objects associated with transmission lines. The C3Ghost module is integrated into the convolutional network of YOLOv5 to reduce floating point operations per second (FLOPs) in the feature channel fusion process and improve feature expression performance. In addition, a FasterNet module is introduced to replace the c3 module in the YOLOv5 Backbone. The FasterNet module uses Partial Convolutions to process only a portion of the input channels, improving feature extraction efficiency and reducing computational overhead. To address the imbalance between simple and challenging samples in the dataset and the diversity of aspect ratios of bounding boxes, the wIoU v3 LOSS is adopted as the loss function. To validate the performance of the proposed approach, Experiments are conducted on a custom dataset of transmission line poles. The results show that the proposed model achieves a 1% increase in detection accuracy, a 13% reduction in FLOPs, and a 26% decrease in model parameters compared to the existing YOLOv5.In the ablation experiment, it was also discovered that while the Fastnet module and the CSghost module improved the precision of the original YOLOv5 baseline model, they caused a decrease in the mAP@.5-.95 metric. However, the improvement of the wIoUv3 loss function significantly mitigated the decline of the mAP@.5-.95 metric.
☆ Insight Into the Collocation of Multi-Source Satellite Imagery for Multi-Scale Vessel Detection
Ship detection from satellite imagery using Deep Learning (DL) is an indispensable solution for maritime surveillance. However, applying DL models trained on one dataset to others having differences in spatial resolution and radiometric features requires many adjustments. To overcome this issue, this paper focused on the DL models trained on datasets that consist of different optical images and a combination of radar and optical data. When dealing with a limited number of training images, the performance of DL models via this approach was satisfactory. They could improve 5-20% of average precision, depending on the optical images tested. Likewise, DL models trained on the combined optical and radar dataset could be applied to both optical and radar images. Our experiments showed that the models trained on an optical dataset could be used for radar images, while those trained on a radar dataset offered very poor scores when applied to optical images.
comment: 5 pages, accepted to IGARSS 2024
☆ MotorEase: Automated Detection of Motor Impairment Accessibility Issues in Mobile App UIs ICSE 2024
Recent research has begun to examine the potential of automatically finding and fixing accessibility issues that manifest in software. However, while recent work makes important progress, it has generally been skewed toward identifying issues that affect users with certain disabilities, such as those with visual or hearing impairments. However, there are other groups of users with different types of disabilities that also need software tooling support to improve their experience. As such, this paper aims to automatically identify accessibility issues that affect users with motor-impairments. To move toward this goal, this paper introduces a novel approach, called MotorEase, capable of identifying accessibility issues in mobile app UIs that impact motor-impaired users. Motor-impaired users often have limited ability to interact with touch-based devices, and instead may make use of a switch or other assistive mechanism -- hence UIs must be designed to support both limited touch gestures and the use of assistive devices. MotorEase adapts computer vision and text processing techniques to enable a semantic understanding of app UI screens, enabling the detection of violations related to four popular, previously unexplored UI design guidelines that support motor-impaired users, including: (i) visual touch target size, (ii) expanding sections, (iii) persisting elements, and (iv) adjacent icon visual distance. We evaluate MotorEase on a newly derived benchmark, called MotorCheck, that contains 555 manually annotated examples of violations to the above accessibility guidelines, across 1599 screens collected from 70 applications via a mobile app testing tool. Our experiments illustrate that MotorEase is able to identify violations with an average accuracy of ~90%, and a false positive rate of less than 9%, outperforming baseline techniques.
comment: Accepted to ICSE 2024 Research Track, 13 pages
☆ SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning ICLR 2024
Generalized Category Discovery (GCD) aims to classify unlabelled images from both `seen' and `unseen' classes by transferring knowledge from a set of labelled `seen' class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: https://visual-ai.github.io/sptnet.
comment: Accepted as a conference paper at ICLR 2024; Project page: https://visual-ai.github.io/sptnet
☆ DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses CVPR 2024
Determining the relative pose of an object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically approximate the continuous pose representation with a large number of discrete pose hypotheses, which incurs a computationally expensive process of scoring each hypothesis at test time. By contrast, we present a Deep Voxel Matching Network (DVMNet) that eliminates the need for pose hypotheses and computes the relative object pose in a single pass. To this end, we map the two input RGB images, reference and query, to their respective voxelized 3D representations. We then pass the resulting voxels through a pose estimation module, where the voxels are aligned and the pose is computed in an end-to-end fashion by solving a least-squares problem. To enhance robustness, we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D, LINEMOD, and Objaverse datasets, demonstrating that our method delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at: https://github.com/sailor-z/DVMNet/.
comment: Accepted by CVPR 2024
☆ Step-Calibrated Diffusion for Biomedical Optical Image Restoration
High-quality, high-resolution medical imaging is essential for clinical care. Raman-based biomedical optical imaging uses non-ionizing infrared radiation to evaluate human tissues in real time and is used for early cancer detection, brain tumor diagnosis, and intraoperative tissue analysis. Unfortunately, optical imaging is vulnerable to image degradation due to laser scattering and absorption, which can result in diagnostic errors and misguided treatment. Restoration of optical images is a challenging computer vision task because the sources of image degradation are multi-factorial, stochastic, and tissue-dependent, preventing a straightforward method to obtain paired low-quality/high-quality data. Here, we present Restorative Step-Calibrated Diffusion (RSCD), an unpaired image restoration method that views the image restoration problem as completing the finishing steps of a diffusion-based image generation task. RSCD uses a step calibrator model to dynamically determine the severity of image degradation and the number of steps required to complete the reverse diffusion process for image restoration. RSCD outperforms other widely used unpaired image restoration methods on both image quality and perceptual evaluation metrics for restoring optical images. Medical imaging experts consistently prefer images restored using RSCD in blinded comparison experiments and report minimal to no hallucinations. Finally, we show that RSCD improves performance on downstream clinical imaging tasks, including automated brain tumor diagnosis and deep tissue imaging. Our code is available at https://github.com/MLNeurosurg/restorative_step-calibrated_diffusion.
☆ AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts
Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble, leading to less-than-ideal outcomes in the task of facial action unit detection. To overcome these shortcomings, we propose a novel approach utilizing audio-visual multimodal data. This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network. Moreover, this paper adaptively captures fusion features across modalities by modeling the temporal relationships, and ultilizes a pre-trained GPT-2 model for sophisticated context-aware fusion of multimodal information. Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios. These findings underscore the potential of integrating temporal dynamics and contextual interpretation, paving the way for future research endeavors.
☆ Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers
Humans see low and high spatial frequency components at the same time, and combine the information from both to form a visual scene. Drawing on this neuroscientific inspiration, we propose an altered Vision Transformer architecture where patches from scaled down versions of the input image are added to the input of the first Transformer Encoder layer. We name this model Retina Vision Transformer (RetinaViT) due to its inspiration from the human visual system. Our experiments show that when trained on the ImageNet-1K dataset with a moderate configuration, RetinaViT achieves a 3.3% performance improvement over the original ViT. We hypothesize that this improvement can be attributed to the inclusion of low spatial frequency components in the input, which improves the ability to capture structural features, and to select and forward important features to deeper layers. RetinaViT thereby opens doors to further investigations into vertical pathways and attention patterns.
☆ DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance CVPR 2024
Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community, covering 4 music genres. With this dataset, we uncover that dance camera movement is multifaceted and human-centric, and possesses multiple influencing factors, making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties, we propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation, we devise new metrics measuring camera movement quality, diversity, and dancer fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM dataset, providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/Carmenw1203/DanceCamera3D-Official.
comment: Accept to CVPR 2024
☆ T-Pixel2Mesh: Combining Global and Local Transformer for 3D Mesh Generation from a Single Image ICASSP 2024
Pixel2Mesh (P2M) is a classical approach for reconstructing 3D shapes from a single color image through coarse-to-fine mesh deformation. Although P2M is capable of generating plausible global shapes, its Graph Convolution Network (GCN) often produces overly smooth results, causing the loss of fine-grained geometry details. Moreover, P2M generates non-credible features for occluded regions and struggles with the domain gap from synthetic data to real-world images, which is a common challenge for single-view 3D reconstruction methods. To address these challenges, we propose a novel Transformer-boosted architecture, named T-Pixel2Mesh, inspired by the coarse-to-fine approach of P2M. Specifically, we use a global Transformer to control the holistic shape and a local Transformer to progressively refine the local geometry details with graph-based point upsampling. To enhance real-world reconstruction, we present the simple yet effective Linear Scale Search (LSS), which serves as prompt tuning during the input preprocessing. Our experiments on ShapeNet demonstrate state-of-the-art performance, while results on real-world data show the generalization capability.
comment: Received by ICASSP 2024
☆ ProMamba: Prompt-Mamba for polyp segmentation
Detecting polyps through colonoscopy is an important task in medical image segmentation, which provides significant assistance and reference value for clinical surgery. However, accurate segmentation of polyps is a challenging task due to two main reasons. Firstly, polyps exhibit various shapes and colors. Secondly, the boundaries between polyps and their normal surroundings are often unclear. Additionally, significant differences between different datasets lead to limited generalization capabilities of existing methods. To address these issues, we propose a segmentation model based on Prompt-Mamba, which incorporates the latest Vision-Mamba and prompt technologies. Compared to previous models trained on the same dataset, our model not only maintains high segmentation accuracy on the validation part of the same dataset but also demonstrates superior accuracy on unseen datasets, exhibiting excellent generalization capabilities. Notably, we are the first to apply the Vision-Mamba architecture to polyp segmentation and the first to utilize prompt technology in a polyp segmentation model. Our model efficiently accomplishes segmentation tasks, surpassing previous state-of-the-art methods by an average of 5% across six datasets. Furthermore, we have developed multiple versions of our model with scaled parameter counts, achieving better performance than previous models even with fewer parameters. Our code and trained weights will be released soon.
comment: 10 pages, 2 figures,3 tabels
☆ Recursive Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition
Multi-modal emotion recognition has recently gained a lot of attention since it can leverage diverse and complementary relationships over multiple modalities, such as audio, visual, and text. Most state-of-the-art methods for multimodal fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of the modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial, vocal, and text modalities extracted from videos. Specifically, we propose a recursive cross-modal attention (RCMA) to effectively capture the complementary relationships across the modalities in a recursive fashion. The proposed model is able to effectively capture the inter-modal relationships by computing the cross-attention weights across the individual modalities and the joint representation of the other two modalities. To further improve the inter-modal relationships, the obtained attended features of the individual modalities are again fed as input to the cross-modal attention to refine the feature representations of the individual modalities. In addition to that, we have used Temporal convolution networks (TCNs) to capture the temporal modeling (intra-modal relationships) of the individual modalities. By deploying the TCNs as well cross-modal attention in a recursive fashion, we are able to effectively capture both intra- and inter-modal relationships across the audio, visual, and text modalities. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed fusion model is able to achieve significant improvement over the baseline for the sixth challenge of Affective Behavior Analysis in-the-Wild 2024 (ABAW6) competition.
comment: arXiv admin note: substantial text overlap with arXiv:2209.09068; text overlap with arXiv:2203.14779 by other authors
☆ Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability Detection
Recent advancements in non-invasive detection of cardiac hemodynamic instability (CHDI) primarily focus on applying machine learning techniques to a single data modality, e.g. cardiac magnetic resonance imaging (MRI). Despite their potential, these approaches often fall short especially when the size of labeled patient data is limited, a common challenge in the medical domain. Furthermore, only a few studies have explored multimodal methods to study CHDI, which mostly rely on costly modalities such as cardiac MRI and echocardiogram. In response to these limitations, we propose a novel multimodal variational autoencoder ($\text{CardioVAE}_\text{X,G}$) to integrate low-cost chest X-ray (CXR) and electrocardiogram (ECG) modalities with pre-training on a large unlabeled dataset. Specifically, $\text{CardioVAE}_\text{X,G}$ introduces a novel tri-stream pre-training strategy to learn both shared and modality-specific features, thus enabling fine-tuning with both unimodal and multimodal datasets. We pre-train $\text{CardioVAE}_\text{X,G}$ on a large, unlabeled dataset of $50,982$ subjects from a subset of MIMIC database and then fine-tune the pre-trained model on a labeled dataset of $795$ subjects from the ASPIRE registry. Comprehensive evaluations against existing methods show that $\text{CardioVAE}_\text{X,G}$ offers promising performance (AUROC $=0.79$ and Accuracy $=0.77$), representing a significant step forward in non-invasive prediction of CHDI. Our model also excels in producing fine interpretations of predictions directly associated with clinical features, thereby supporting clinical decision-making.
☆ Learning User Embeddings from Human Gaze for Personalised Saliency Prediction
Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.
☆ ZoDi: Zero-Shot Domain Adaptation with Diffusion-Based Image Transfer
Deep learning models achieve high accuracy in segmentation tasks among others, yet domain shift often degrades the models' performance, which can be critical in real-world scenarios where no target images are available. This paper proposes a zero-shot domain adaptation method based on diffusion models, called ZoDi, which is two-fold by the design: zero-shot image transfer and model adaptation. First, we utilize an off-the-shelf diffusion model to synthesize target-like images by transferring the domain of source images to the target domain. In this we specifically try to maintain the layout and content by utilising layout-to-image diffusion models with stochastic inversion. Secondly, we train the model using both source images and synthesized images with the original segmentation maps while maximizing the feature similarity of images from the two domains to learn domain-robust representations. Through experiments we show benefits of ZoDi in the task of image segmentation over state-of-the-art methods. It is also more applicable than existing CLIP-based methods because it assumes no specific backbone or models, and it enables to estimate the model's performance without target images by inspecting generated images. Our implementation will be publicly available.
☆ Meta-Point Learning and Refining for Category-Agnostic Pose Estimation CVPR 2024
Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary classes given a few support images annotated with keypoints. Existing methods only rely on the features extracted at support keypoints to predict or refine the keypoints on query image, but a few support feature vectors are local and inadequate for CAPE. Considering that human can quickly perceive potential keypoints of arbitrary objects, we propose a novel framework for CAPE based on such potential keypoints (named as meta-points). Specifically, we maintain learnable embeddings to capture inherent information of various keypoints, which interact with image feature maps to produce meta-points without any support. The produced meta-points could serve as meaningful potential keypoints for CAPE. Due to the inevitable gap between inherency and annotation, we finally utilize the identities and details offered by support keypoints to assign and refine meta-points to desired keypoints in query image. In addition, we propose a progressive deformable point decoder and a slacked regression loss for better prediction and supervision. Our novel framework not only reveals the inherency of keypoints but also outperforms existing methods of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100 dataset demonstrate the effectiveness of our framework.
comment: Published in CVPR 2024
☆ H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation
In the field of medical image segmentation, variant models based on Convolutional Neural Networks (CNNs) and Visual Transformers (ViTs) as the base modules have been very widely developed and applied. However, CNNs are often limited in their ability to deal with long sequences of information, while the low sensitivity of ViTs to local feature information and the problem of secondary computational complexity limit their development. Recently, the emergence of state-space models (SSMs), especially 2D-selective-scan (SS2D), has had an impact on the longtime dominance of traditional CNNs and ViTs as the foundational modules of visual neural networks. In this paper, we extend the adaptability of SS2D by proposing a High-order Vision Mamba UNet (H-vmunet) for medical image segmentation. Among them, the proposed High-order 2D-selective-scan (H-SS2D) progressively reduces the introduction of redundant information during SS2D operations through higher-order interactions. In addition, the proposed Local-SS2D module improves the learning ability of local features of SS2D at each order of interaction. We conducted comparison and ablation experiments on three publicly available medical image datasets (ISIC2017, Spleen, and CVC-ClinicDB), and the results all demonstrate the strong competitiveness of H-vmunet in medical image segmentation tasks. The code is available from https://github.com/wurenkai/H-vmunet .
☆ VL-Mamba: Exploring State Space Models for Multimodal Learning
Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.
☆ ReGround: Improving Textual and Spatial Grounding at No Cost
When an image generation process is guided by both a text prompt and spatial cues, such as a set of bounding boxes, do these elements work in harmony, or does one dominate the other? Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention. We demonstrate that such bias can be significantly mitigated without sacrificing accuracy in either grounding by simply rewiring the network architecture, changing from sequential to parallel for gated self-attention and cross-attention. This surprisingly simple yet effective solution does not require any fine-tuning of the network but significantly reduces the trade-off between the two groundings. Our experiments demonstrate significant improvements from the original GLIGEN to the rewired version in the trade-off between textual grounding and spatial grounding.
comment: Project page: https://re-ground.github.io/
☆ Leveraging feature communication in federated learning for remote sensing image classification
In the realm of Federated Learning (FL) applied to remote sensing image classification, this study introduces and assesses several innovative communication strategies. Our exploration includes feature-centric communication, pseudo-weight amalgamation, and a combined method utilizing both weights and features. Experiments conducted on two public scene classification datasets unveil the effectiveness of these strategies, showcasing accelerated convergence, heightened privacy, and reduced network information exchange. This research provides valuable insights into the implications of feature-centric communication in FL, offering potential applications tailored for remote sensing scenarios.
comment: 5 pages, to appear in IGARSS 2024
☆ Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer
In this paper, we propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis. Different from existing methods that often learn from reconstructing monocular videos guided by 3DMM, we employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner, avoiding reliance on inaccurate 3DMM reconstruction that could be detrimental to the synthesis performance. The key idea is to first learn a 3D head synthesizer using synthetic multi-view images to convert monocular real videos into multi-view ones, and then utilize the pseudo multi-view videos to learn a 4D head synthesizer via cross-view self-reenactment. By leveraging a simple vision transformer backbone with motion-aware cross-attentions, our method exhibits superior performance compared to previous methods in terms of reconstruction fidelity, geometry consistency, and motion control accuracy. We hope our method offers novel insights into integrating 3D priors with 2D supervisions for improved 4D head avatar creation.
comment: Project page: https://yudeng.github.io/Portrait4D-v2/
☆ Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments
In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available in the supplementary material.
☆ Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing
Despite recent advancements in text-to-image diffusion models facilitating various image editing techniques, complex text prompts often lead to an oversight of some requests due to a bottleneck in processing text information. To tackle this challenge, we present Ground-A-Score, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation. This approach ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image. Moreover, the selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas while preserving the integrity of the objects in the source image. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts, ensuring high-quality outcomes that respect the original image attributes.
☆ Diversity-aware Channel Pruning for StyleGAN Compression CVPR 2024
StyleGAN has shown remarkable performance in unconditional image generation. However, its high computational cost poses a significant challenge for practical applications. Although recent efforts have been made to compress StyleGAN while preserving its performance, existing compressed models still lag behind the original model, particularly in terms of sample diversity. To overcome this, we propose a novel channel pruning method that leverages varying sensitivities of channels to latent vectors, which is a key factor in sample diversity. Specifically, by assessing channel importance based on their sensitivities to latent vector perturbations, our method enhances the diversity of samples in the compressed model. Since our method solely focuses on the channel pruning stage, it has complementary benefits with prior training schemes without additional training cost. Extensive experiments demonstrate that our method significantly enhances sample diversity across various datasets. Moreover, in terms of FID scores, our method not only surpasses state-of-the-art by a large margin but also achieves comparable scores with only half training iterations.
comment: Accepted to CVPR 2024. Project page: https://jiwoogit.github.io/DCP-GAN_site
☆ Next day fire prediction via semantic segmentation ACL
In this paper we present a deep learning pipeline for next day fire prediction. The next day fire prediction task consists in learning models that receive as input the available information for an area up until a certain day, in order to predict the occurrence of fire for the next day. Starting from our previous problem formulation as a binary classification task on instances (daily snapshots of each area) represented by tabular feature vectors, we reformulate the problem as a semantic segmentation task on images; there, each pixel corresponds to a daily snapshot of an area, while its channels represent the formerly tabular training features. We demonstrate that this problem formulation, built within a thorough pipeline achieves state of the art results.
comment: Accepted in MACLEAN@ECML/PKDD 2023
☆ What explains the success of cross-modal fine-tuning with ORCA?
ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.
☆ IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models
Leveraging Stable Diffusion for the generation of personalized portraits has emerged as a powerful and noteworthy tool, enabling users to create high-fidelity, custom character avatars based on their specific prompts. However, existing personalization methods face challenges, including test-time fine-tuning, the requirement of multiple input images, low preservation of identity, and limited diversity in generated outcomes. To overcome these challenges, we introduce IDAdapter, a tuning-free approach that enhances the diversity and identity preservation in personalized image generation from a single face image. IDAdapter integrates a personalized concept into the generation process through a combination of textual and visual injections and a face identity loss. During the training phase, we incorporate mixed features from multiple reference images of a specific identity to enrich identity-related content details, guiding the model to generate images with more diverse styles, expressions, and angles compared to previous works. Extensive evaluations demonstrate the effectiveness of our method, achieving both diversity and identity fidelity in generated images.
comment: 14 pages, 15 figures
☆ Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
3D generation has witnessed significant advancements, yet efficiently producing high-quality 3D assets from a single image remains challenging. In this paper, we present a triplane autoencoder, which encodes 3D models into a compact triplane latent space to effectively compress both the 3D geometry and texture information. Within the autoencoder framework, we introduce a 3D-aware cross-attention mechanism, which utilizes low-resolution latent representations to query features from a high-resolution 3D feature volume, thereby enhancing the representation capacity of the latent space. Subsequently, we train a diffusion model on this refined latent space. In contrast to solely relying on image embedding for 3D generation, our proposed method advocates for the simultaneous utilization of both image embedding and shape embedding as conditions. Specifically, the shape embedding is estimated via a diffusion prior model conditioned on the image embedding. Through comprehensive experiments, we demonstrate that our method outperforms state-of-the-art algorithms, achieving superior performance while requiring less training data and time. Our approach enables the generation of high-quality 3D assets in merely 7 seconds on a single A100 GPU.
☆ REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning
Exemplar-free class-incremental learning (EFCIL) aims to mitigate catastrophic forgetting in class-incremental learning without available historical data. Compared with its counterpart (replay-based CIL) that stores historical samples, the EFCIL suffers more from forgetting issues under the exemplar-free constraint. In this paper, inspired by the recently developed analytic learning (AL) based CIL, we propose a representation enhanced analytic learning (REAL) for EFCIL. The REAL constructs a dual-stream base pretraining (DS-BPT) and a representation enhancing distillation (RED) process to enhance the representation of the extractor. The DS-BPT pretrains model in streams of both supervised learning and self-supervised contrastive learning (SSCL) for base knowledge extraction. The RED process distills the supervised knowledge to the SSCL pretrained backbone and facilitates a subsequent AL-basd CIL that converts the CIL to a recursive least-square problem. Our method addresses the issue of insufficient discriminability in representations of unseen data caused by a frozen backbone in the existing AL-based CIL. Empirical results on various datasets including CIFAR-100, ImageNet-100 and ImageNet-1k, demonstrate that our REAL outperforms the state-of-the-arts in EFCIL, and achieves comparable or even more superior performance compared with the replay-based methods.
☆ Motion Generation from Fine-grained Textual Descriptions
The task of text2motion is to generate motion sequences from given textual descriptions, where a model should explore the interactions between natural language instructions and human body movements. While most existing works are confined to coarse-grained motion descriptions (e.g., "A man squats."), fine-grained ones specifying movements of relevant body parts are barely explored. Models trained with coarse texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure in generating motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset with fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with delicate prompts. Accordingly, we design a new text2motion model, FineMotionDiffuse, which makes full use of fine-grained textual information. Our experiments show that FineMotionDiffuse trained on FineHumanML3D acquires good results in quantitative evaluation. We also find this model can better generate spatially/chronologically composite motions by learning the implicit mappings from simple descriptions to the corresponding basic motions.
☆ What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models
This paper presents a way of enhancing the reliability of Large Multimodal Models (LMMs) in addressing hallucination effects, where models generate incorrect or unrelated responses. Without additional instruction tuning paradigm, we introduce Counterfactual Inception, a novel method that implants counterfactual thoughts into LMMs using carefully chosen, misaligned counterfactual keywords. This method is grounded in the concept of counterfactual thinking, a cognitive process where humans consider alternative realities and outcomes. By applying this human-like reasoning mechanism to LMMs, we aim to reduce hallucination effects and improve the models' trustworthiness. We also propose Dual-modality Verification Process (DVP), a rigorous framework for selecting optimal counterfactual keywords to trigger counterfactual thinking into LMMs, concurrently considering visual and linguistic context. Our extensive experiments across various LMMs, including both open-source and proprietary models, corroborate that our method significantly mitigates hallucination phenomena across different datasets.
comment: under review, code available: https://github.com/IVY-LVLM/Counterfactual-Inception
☆ Scale Decoupled Distillation CVPR2024
Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at: https://github.com/shicaiwei123/SDD-CVPR2024
comment: Accepted to CVPR2024 10 pages 6figure
☆ High-confidence pseudo-labels for domain adaptation in COVID-19 detection
This paper outlines our submission for the 4th COV19D competition as part of the `Domain adaptation, Explainability, Fairness in AI for Medical Image Analysis' (DEF-AI-MIA) workshop at the Computer Vision and Pattern Recognition Conference (CVPR). The competition consists of two challenges. The first is to train a classifier to detect the presence of COVID-19 from over one thousand CT scans from the COV19-CT-DB database. The second challenge is to perform domain adaptation by taking the dataset from Challenge 1 and adding a small number of scans (some annotated and other not) for a different distribution. We preprocessed the CT scans to segment the lungs, and output volumes with the lungs individually and together. We then trained 3D ResNet and Swin Transformer models on these inputs. We annotated the unlabeled CT scans using an ensemble of these models and chose the high-confidence predictions as pseudo-labels for fine-tuning. This resulted in a best cross-validation mean F1 score of 93.39\% for Challenge 1 and a mean F1 score of 92.15 for Challenge 2.
☆ FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs
Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Intriguingly, our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate. Overall, our observations inspire a further understanding of multi-modal robustness and safety-related feature alignment across different modalities, which is of great importance for various large multi-modal models. Our code is available at https://github.com/THU-Kingmin/FMM-Attack.
☆ VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.
comment: Project page: https://yumengli007.github.io/VSTAR
☆ Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image captioning and visual question answering when coupled with pre-trained vision backbones. While different approaches have been explored to interface LLMs with ``perceptual backbones'' that process, e.g., visual or audio data, they are often explored for different tasks, different datasets, and using different perceptual backbones and language models, hindering direct comparison of the interfacing mechanisms. To remedy this lack of comparability between methods, we present an extensive experimental evaluation of different interfacing mechanisms, across multiple tasks (including image, video, and audio captioning as well as visual question answering), datasets and backbones, paying special attention to low-data settings. We find improved performance using existing mechanisms over state-of-the-art results, and identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
☆ A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels
Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges -- enforcing the multimodal samples to \emph{align incorrect semantics} and \emph{widen the heterogeneous gap}, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multi-modal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these two components leverage the inherent correlation among multi-modal data to facilitate effective cost function. The experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Deepfake Detection without Deepfakes: Generalization via Synthetic Frequency Patterns Injection
Deepfake detectors are typically trained on large sets of pristine and generated images, resulting in limited generalization capacity; they excel at identifying deepfakes created through methods encountered during training but struggle with those generated by unknown techniques. This paper introduces a learning approach aimed at significantly enhancing the generalization capabilities of deepfake detectors. Our method takes inspiration from the unique "fingerprints" that image generation processes consistently introduce into the frequency domain. These fingerprints manifest as structured and distinctly recognizable frequency patterns. We propose to train detectors using only pristine images injecting in part of them crafted frequency patterns, simulating the effects of various deepfake generation techniques without being specific to any. These synthetic patterns are based on generic shapes, grids, or auras. We evaluated our approach using diverse architectures across 25 different generation methods. The models trained with our approach were able to perform state-of-the-art deepfake detection, demonstrating also superior generalization capabilities in comparison with previous methods. Indeed, they are untied to any specific generation technique and can effectively identify deepfakes regardless of how they were made.
☆ Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion
Computer vision techniques play a central role in the perception stack of autonomous vehicles. Such methods are employed to perceive the vehicle surroundings given sensor data. 3D LiDAR sensors are commonly used to collect sparse 3D point clouds from the scene. However, compared to human perception, such systems struggle to deduce the unseen parts of the scene given those sparse point clouds. In this matter, the scene completion task aims at predicting the gaps in the LiDAR measurements to achieve a more complete scene representation. Given the promising results of recent diffusion models as generative models for images, we propose extending them to achieve scene completion from a single 3D LiDAR scan. Previous works used diffusion models over range images extracted from LiDAR data, directly applying image-based diffusion methods. Distinctly, we propose to directly operate on the points, reformulating the noising and denoising diffusion process such that it can efficiently work at scene scale. Together with our approach, we propose a regularization loss to stabilize the noise predicted during the denoising process. Our experimental evaluation shows that our method can complete the scene given a single LiDAR scan as input, producing a scene with more details compared to state-of-the-art scene completion methods. We believe that our proposed diffusion process formulation can support further research in diffusion models applied to scene-scale point cloud data.
☆ Progressive trajectory matching for medical dataset distillation
It is essential but challenging to share medical image datasets due to privacy issues, which prohibit building foundation models and knowledge transfer. In this paper, we propose a novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets. Existing methods tackle only natural images by randomly matching parts of the training trajectories of the model parameters trained by the whole real datasets. However, through extensive experiments on medical image datasets, the training process is extremely unstable and achieves inferior distillation results. To solve these barriers, we propose to design a novel progressive trajectory matching strategy to improve the training stability for medical image dataset distillation. Additionally, it is observed that improved stability prevents the synthetic dataset diversity and final performance improvements. Therefore, we propose a dynamic overlap mitigation module that improves the synthetic dataset diversity by dynamically eliminating the overlap across different images and retraining parts of the synthetic images for better convergence. Finally, we propose a new medical image dataset distillation benchmark of various modalities and configurations to promote fair evaluations. It is validated that our proposed method achieves 8.33% improvement over previous state-of-the-art methods on average, and 11.7% improvement when ipc=2 (i.e., image per class is 2). Codes and benchmarks will be released.
☆ CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models
This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.
☆ An AI-Assisted Skincare Routine Recommendation System in XR
In recent years, there has been an increasing interest in the use of artificial intelligence (AI) and extended reality (XR) in the beauty industry. In this paper, we present an AI-assisted skin care recommendation system integrated into an XR platform. The system uses a convolutional neural network (CNN) to analyse an individual's skin type and recommend personalised skin care products in an immersive and interactive manner. Our methodology involves collecting data from individuals through a questionnaire and conducting skin analysis using a provided facial image in an immersive environment. This data is then used to train the CNN model, which recognises the skin type and existing issues and allows the recommendation engine to suggest personalised skin care products. We evaluate our system in terms of the accuracy of the CNN model, which achieves an average score of 93% in correctly classifying existing skin issues. Being integrated into an XR system, this approach has the potential to significantly enhance the beauty industry by providing immersive and engaging experiences to users, leading to more efficient and consistent skincare routines.
☆ HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.
☆ MedCycle: Unpaired Medical Report Generation via Cycle-Consistency
Generating medical reports for X-ray images presents a significant challenge, particularly in unpaired scenarios where access to paired image-report data for training is unavailable. Previous works have typically learned a joint embedding space for images and reports, necessitating a specific labeling schema for both. We introduce an innovative approach that eliminates the need for consistent labeling schemas, thereby enhancing data accessibility and enabling the use of incompatible datasets. This approach is based on cycle-consistent mapping functions that transform image embeddings into report embeddings, coupled with report auto-encoding for medical report generation. Our model and objectives consider intricate local details and the overarching semantic context within images and reports. This approach facilitates the learning of effective mapping functions, resulting in the generation of coherent reports. It outperforms state-of-the-art results in unpaired chest X-ray report generation, demonstrating improvements in both language and clinical metrics.
☆ Fast-Poly: A Fast Polyhedral Framework For 3D Multi-Object Tracking
3D Multi-Object Tracking (MOT) captures stable and comprehensive motion states of surrounding obstacles, essential for robotic perception. However, current 3D trackers face issues with accuracy and latency consistency. In this paper, we propose Fast-Poly, a fast and effective filter-based method for 3D MOT. Building upon our previous work Poly-MOT, Fast-Poly addresses object rotational anisotropy in 3D space, enhances local computation densification, and leverages parallelization technique, improving inference speed and precision. Fast-Poly is extensively tested on two large-scale tracking benchmarks with Python implementation. On the nuScenes dataset, Fast-Poly achieves new state-of-the-art performance with 75.8% AMOTA among all methods and can run at 34.2 FPS on a personal CPU. On the Waymo dataset, Fast-Poly exhibits competitive accuracy with 63.6% MOTA and impressive inference speed (35.5 FPS). The source code is publicly available at https://github.com/lixiaoyu2000/FastPoly.
comment: 1st on the NuScenes Tracking benchmark with 75.8 AMOTA and 34.2 FPS
☆ Stochastic Geometry Models for Texture Synthesis of Machined Metallic Surfaces: Sandblasting and Milling
Training defect detection algorithms for visual surface inspection systems requires a large and representative set of training data. Often there is not enough real data available which additionally cannot cover the variety of possible defects. Synthetic data generated by a synthetic visual surface inspection environment can overcome this problem. Therefore, a digital twin of the object is needed, whose micro-scale surface topography is modeled by texture synthesis models. We develop stochastic texture models for sandblasted and milled surfaces based on topography measurements of such surfaces. As the surface patterns differ significantly, we use separate modeling approaches for the two cases. Sandblasted surfaces are modeled by a combination of data-based texture synthesis methods that rely entirely on the measurements. In contrast, the model for milled surfaces is procedural and includes all process-related parameters known from the machine settings.
☆ Advancing 6D Pose Estimation in Augmented Reality -- Overcoming Projection Ambiguity with Uncontrolled Imagery
This study addresses the challenge of accurate 6D pose estimation in Augmented Reality (AR), a critical component for seamlessly integrating virtual objects into real-world environments. Our research primarily addresses the difficulty of estimating 6D poses from uncontrolled RGB images, a common scenario in AR applications, which lacks metadata such as focal length. We propose a novel approach that strategically decomposes the estimation of z-axis translation and focal length, leveraging the neural-render and compare strategy inherent in the FocalPose architecture. This methodology not only streamlines the 6D pose estimation process but also significantly enhances the accuracy of 3D object overlaying in AR settings. Our experimental results demonstrate a marked improvement in 6D pose estimation accuracy, with promising applications in manufacturing and robotics. Here, the precise overlay of AR visualizations and the advancement of robotic vision systems stand to benefit substantially from our findings.
☆ MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. Pretraining is an active research topic, encompassing supervised and self-supervised learning methods to initialize model weights effectively. However, transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. In this study, we explore the Multi-Task Pretraining (MTP) paradigm for RS foundation models to address this issue. Using a shared encoder and task-specific decoder architecture, we conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. MTP supports both convolutional neural networks and vision transformer foundation models with over 300 million parameters. The pretrained models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection. Extensive experiments across 14 datasets demonstrate the superiority of our models over existing ones of similar size and their competitive performance compared to larger state-of-the-art models, thus validating the effectiveness of MTP.
comment: The codes and pretrained models will be released at https://github.com/ViTAE-Transformer/MTP
☆ Diversified and Personalized Multi-rater Medical Image Segmentation CVPR 2024
Annotation ambiguity due to inherent data uncertainties such as blurred boundaries in medical scans and different observer expertise and preferences has become a major obstacle for training deep-learning based medical image segmentation models. To address it, the common practice is to gather multiple annotations from different experts, leading to the setting of multi-rater medical image segmentation. Existing works aim to either merge different annotations into the "groundtruth" that is often unattainable in numerous medical contexts, or generate diverse results, or produce personalized results corresponding to individual expert raters. Here, we bring up a more ambitious goal for multi-rater medical image segmentation, i.e., obtaining both diversified and personalized results. Specifically, we propose a two-stage framework named D-Persona (first Diversification and then Personalization). In Stage I, we exploit multiple given annotations to train a Probabilistic U-Net model, with a bound-constrained loss to improve the prediction diversity. In this way, a common latent space is constructed in Stage I, where different latent codes denote diversified expert opinions. Then, in Stage II, we design multiple attention-based projection heads to adaptively query the corresponding expert prompts from the shared latent space, and then perform the personalized medical image segmentation. We evaluated the proposed model on our in-house Nasopharyngeal Carcinoma dataset and the public lung nodule dataset (i.e., LIDC-IDRI). Extensive experiments demonstrated our D-Persona can provide diversified and personalized results at the same time, achieving new SOTA performance for multi-rater medical image segmentation. Our code will be released at https://github.com/ycwu1997/D-Persona.
comment: Accepted by CVPR 2024
☆ Cell Tracking in C. elegans with Cell Position Heatmap-Based Alignment and Pairwise Detection
3D cell tracking in a living organism has a crucial role in live cell image analysis. Cell tracking in C. elegans has two difficulties. First, cell migration in a consecutive frame is large since they move their head during scanning. Second, cell detection is often inconsistent in consecutive frames due to touching cells and low-contrast images, and these inconsistent detections affect the tracking performance worse. In this paper, we propose a cell tracking method to address these issues, which has two main contributions. First, we introduce cell position heatmap-based non-rigid alignment with test-time fine-tuning, which can warp the detected points to near the positions at the next frame. Second, we propose a pairwise detection method, which uses the information of detection results at the previous frame for detecting cells at the current frame. The experimental results demonstrate the effectiveness of each module, and the proposed method achieved the best performance in comparison.
comment: 4 pages, 5 figures, Accepted in EMBC 2023
☆ S2DM: Sector-Shaped Diffusion Models for Video Generation
Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at https://s2dm.github.io/S2DM/.
comment: 17 pages, 6 figures
☆ DOR3D-Net: Dense Ordinal Regression Network for 3D Hand Pose Estimation
Depth-based 3D hand pose estimation is an important but challenging research task in human-machine interaction community. Recently, dense regression methods have attracted increasing attention in 3D hand pose estimation task, which provide a low computational burden and high accuracy regression way by densely regressing hand joint offset maps. However, large-scale regression offset values are often affected by noise and outliers, leading to a significant drop in accuracy. To tackle this, we re-formulate 3D hand pose estimation as a dense ordinal regression problem and propose a novel Dense Ordinal Regression 3D Pose Network (DOR3D-Net). Specifically, we first decompose offset value regression into sub-tasks of binary classifications with ordinal constraints. Then, each binary classifier can predict the probability of a binary spatial relationship relative to joint, which is easier to train and yield much lower level of noise. The estimated hand joint positions are inferred by aggregating the ordinal regression results at local positions with a weighted sum. Furthermore, both joint regression loss and ordinal regression loss are used to train our DOR3D-Net in an end-to-end manner. Extensive experiments on public datasets (ICVL, MSRA, NYU and HANDS2017) show that our design provides significant improvements over SOTA methods.
☆ Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments ICRA
Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at https://github.com/DLR-RM/UMF
comment: Accepted submission to International Conference on Robotics and Automation (ICRA), 2024
☆ Robust image segmentation model based on binary level set SC
In order to improve the robustness of traditional image segmentation models to noise, this paper models the illumination term in intensity inhomogeneity images. Additionally, to enhance the model's robustness to noisy images, we incorporate the binary level set model into the proposed model. Compared to the traditional level set, the binary level set eliminates the need for continuous reinitialization. Moreover, by introducing the variational operator GL, our model demonstrates better capability in segmenting noisy images. Finally, we employ the three-step splitting operator method for solving, and the effectiveness of the proposed model is demonstrated on various images.
comment: SCI
☆ IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis
Semantic image synthesis aims to generate high-quality images given semantic conditions, i.e. segmentation masks and style reference images. Existing methods widely adopt generative adversarial networks (GANs). GANs take all conditional inputs and directly synthesize images in a single forward step. In this paper, semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model (IIDM). Specifically, the style reference is first contaminated with random noise and then progressively denoised by IIDM, guided by segmentation masks. Moreover, three techniques, refinement, color-transfer and model ensembles, are proposed to further boost the generation quality. They are plug-in inference modules and do not require additional training. Extensive experiments show that our IIDM outperforms existing state-of-the-art methods by clear margins. Further analysis is provided via detailed demonstrations. We have implemented IIDM based on the Jittor framework; code is available at https://github.com/ader47/jittor-jieke-semantic_images_synthesis.
comment: 6 pages, 7 figures, accetped by CVMJ 2024
☆ Correlation Clustering of Organoid Images
In biological and medical research, scientists now routinely acquire microscopy images of hundreds of morphologically heterogeneous organoids and are then faced with the task of finding patterns in the image collection, i.e., subsets of organoids that appear similar and potentially represent the same morphological class. We adopt models and algorithms for correlating organoid images, i.e., for quantifying the similarity in appearance and geometry of the organoids they depict, and for clustering organoid images by consolidating conflicting correlations. For correlating organoid images, we adopt and compare two alternatives, a partial quadratic assignment problem and a twin network. For clustering organoid images, we employ the correlation clustering problem. Empirically, we learn the parameters of these models, infer a clustering of organoid images, and quantify the accuracy of the inferred clusters, with respect to a training set and a test set we contribute of state-of-the-art light microscopy images of organoids clustered manually by biologists.
comment: 19 pages
☆ Few-shot Oriented Object Detection with Memorable Contrastive Learning in Remote Sensing Images
Few-shot object detection (FSOD) has garnered significant research attention in the field of remote sensing due to its ability to reduce the dependency on large amounts of annotated data. However, two challenges persist in this area: (1) axis-aligned proposals, which can result in misalignment for arbitrarily oriented objects, and (2) the scarcity of annotated data still limits the performance for unseen object categories. To address these issues, we propose a novel FSOD method for remote sensing images called Few-shot Oriented object detection with Memorable Contrastive learning (FOMC). Specifically, we employ oriented bounding boxes instead of traditional horizontal bounding boxes to learn a better feature representation for arbitrary-oriented aerial objects, leading to enhanced detection performance. To the best of our knowledge, we are the first to address oriented object detection in the few-shot setting for remote sensing images. To address the challenging issue of object misclassification, we introduce a supervised contrastive learning module with a dynamically updated memory bank. This module enables the use of large batches of negative samples and enhances the model's capability to learn discriminative features for unseen classes. We conduct comprehensive experiments on the DOTA and HRSC2016 datasets, and our model achieves state-of-the-art performance on the few-shot oriented object detection task. Code and pretrained models will be released.
comment: 13 pages, 8 tables, 10 figures
☆ Counting Network for Learning from Majority Label ICASSP 2024
The paper proposes a novel problem in multi-class Multiple-Instance Learning (MIL) called Learning from the Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag's label. LML aims to classify instances using bag-level majority classes. This problem is valuable in various applications. Existing MIL methods are unsuitable for LML due to aggregating confidences, which may lead to inconsistency between the bag-level label and the label obtained by counting the number of instances for each class. This may lead to incorrect instance-level classification. We propose a counting network trained to produce the bag-level majority labels estimated by counting the number of instances for each class. This led to the consistency of the majority class between the network outputs and one obtained by counting the number of instances. Experimental results show that our counting network outperforms conventional MIL methods on four datasets The code is publicly available at https://github.com/Shiku-Kaito/Counting-Network-for-Learning-from-Majority-Label.
comment: 5 pages, 4 figures, Accepted in ICASSP 2024
☆ ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics IROS 2024
Robotic manipulation in everyday scenarios, especially in unstructured environments, requires skills in pose-aware object manipulation (POM), which adapts robots' grasping and handling according to an object's 6D pose. Recognizing an object's position and orientation is crucial for effective manipulation. For example, if a mug is lying on its side, it's more effective to grasp it by the rim rather than the handle. Despite its importance, research in POM skills remains limited, because learning manipulation skills requires pose-varying simulation environments and datasets. This paper introduces ManiPose, a pioneering benchmark designed to advance the study of pose-varying manipulation tasks. ManiPose encompasses: 1) Simulation environments for POM feature tasks ranging from 6D pose-specific pick-and-place of single objects to cluttered scenes, further including interactions with articulated objects. 2) A comprehensive dataset featuring geometrically consistent and manipulation-oriented 6D pose labels for 2936 real-world scanned rigid objects and 100 articulated objects across 59 categories. 3) A baseline for POM, leveraging the inferencing abilities of LLM (e.g., ChatGPT) to analyze the relationship between 6D pose and task-specific requirements, offers enhanced pose-aware grasp prediction and motion planning capabilities. Our benchmark demonstrates notable advancements in pose estimation, pose-aware manipulation, and real-robot skill transfer, setting new standards for POM research. We will open-source the ManiPose benchmark with the final version paper, inviting the community to engage with our resources, available at our website:https://sites.google.com/view/manipose.
comment: 8 pages, 7 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
☆ AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.
☆ OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning
Redundancy is a persistent challenge in Capsule Networks (CapsNet),leading to high computational costs and parameter counts. Although previous works have introduced pruning after the initial capsule layer, dynamic routing's fully connected nature and non-orthogonal weight matrices reintroduce redundancy in deeper layers. Besides, dynamic routing requires iterating to converge, further increasing computational demands. In this paper, we propose an Orthogonal Capsule Network (OrthCaps) to reduce redundancy, improve routing performance and decrease parameter counts. Firstly, an efficient pruned capsule layer is introduced to discard redundant capsules. Secondly, dynamic routing is replaced with orthogonal sparse attention routing, eliminating the need for iterations and fully connected structures. Lastly, weight matrices during routing are orthogonalized to sustain low capsule similarity, which is the first approach to introduce orthogonality into CapsNet as far as we know. Our experiments on baseline datasets affirm the efficiency and robustness of OrthCaps in classification tasks, in which ablation studies validate the criticality of each component. Remarkably, OrthCaps-Shallow outperforms other Capsule Network benchmarks on four datasets, utilizing only 110k parameters, which is a mere 1.25% of a standard Capsule Network's total. To the best of our knowledge, it achieves the smallest parameter count among existing Capsule Networks. Similarly, OrthCaps-Deep demonstrates competitive performance across four datasets, utilizing only 1.2% of the parameters required by its counterparts.
comment: 8 pages
☆ Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection
Unified anomaly detection (AD) is one of the most challenges for anomaly detection, where one unified model is trained with normal samples from multiple classes with the objective to detect anomalies in these classes. For such a challenging task, popular normalizing flow (NF) based AD methods may fall into a "homogeneous mapping" issue,where the NF-based AD models are biased to generate similar latent representations for both normal and abnormal features, and thereby lead to a high missing rate of anomalies. In this paper, we propose a novel Hierarchical Gaussian mixture normalizing flow modeling method for accomplishing unified Anomaly Detection, which we call HGAD. Our HGAD consists of two key components: inter-class Gaussian mixture modeling and intra-class mixed class centers learning. Compared to the previous NF-based AD methods, the hierarchical Gaussian mixture modeling approach can bring stronger representation capability to the latent space of normalizing flows, so that even complex multi-class distribution can be well represented and learned in the latent space. In this way, we can avoid mapping different class distributions into the same single Gaussian prior, thus effectively avoiding or mitigating the "homogeneous mapping" issue. We further indicate that the more distinguishable different class centers, the more conducive to avoiding the bias issue. Thus, we further propose a mutual information maximization loss for better structuring the latent feature space. We evaluate our method on four real-world AD benchmarks, where we can significantly improve the previous NF-based AD methods and also outperform the SOTA unified AD methods.
comment: 15 pages
☆ vid-TLDR: Training Free Token merging for Light-weight Video Transformer CVPR
Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However, these video transformers suffer from heavy computational costs induced by the massive number of tokens across the entire video frames, which has been the major barrier to training the model. Further, the patches irrelevant to the main contents, e.g., backgrounds, degrade the generalization performance of models. To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training. For vid-TLDR, we introduce a novel approach to capture the salient regions in videos only with the attention map. Further, we introduce the saliency-aware token merging strategy by dropping the background tokens and sharpening the object scores. Our experiments show that vid-TLDR significantly mitigates the computational complexity of video Transformers while achieving competitive performance compared to the base model without vid-TLDR. Code is available at https://github.com/mlvlab/vid-TLDR.
comment: Conference on Computer Vision and Pattern Recognition (CVPR), 2024
☆ TiBiX: Leveraging Temporal Information for Bidirectional X-ray and Report Generation
With the emergence of vision language models in the medical imaging domain, numerous studies have focused on two dominant research activities: (1) report generation from Chest X-rays (CXR), and (2) synthetic scan generation from text or reports. Despite some research incorporating multi-view CXRs into the generative process, prior patient scans and reports have been generally disregarded. This can inadvertently lead to the leaving out of important medical information, thus affecting generation quality. To address this, we propose TiBiX: Leveraging Temporal information for Bidirectional X-ray and Report Generation. Considering previous scans, our approach facilitates bidirectional generation, primarily addressing two challenging problems: (1) generating the current image from the previous image and current report and (2) generating the current report based on both the previous and current images. Moreover, we extract and release a curated temporal benchmark dataset derived from the MIMIC-CXR dataset, which focuses on temporal data. Our comprehensive experiments and ablation studies explore the merits of incorporating prior CXRs and achieve state-of-the-art (SOTA) results on the report generation task. Furthermore, we attain on-par performance with SOTA image generation efforts, thus serving as a new baseline in longitudinal bidirectional CXR-to-report generation. The code is available at https://github.com/BioMedIA-MBZUAI/TiBiX.
☆ FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis
The scarcity of well-annotated medical datasets requires leveraging transfer learning from broader datasets like ImageNet or pre-trained models like CLIP. Model soups averages multiple fine-tuned models aiming to improve performance on In-Domain (ID) tasks and enhance robustness against Out-of-Distribution (OOD) datasets. However, applying these methods to the medical imaging domain faces challenges and results in suboptimal performance. This is primarily due to differences in error surface characteristics that stem from data complexities such as heterogeneity, domain shift, class imbalance, and distributional shifts between training and testing phases. To address this issue, we propose a hierarchical merging approach that involves local and global aggregation of models at various levels based on models' hyperparameter configurations. Furthermore, to alleviate the need for training a large number of models in the hyperparameter search, we introduce a computationally efficient method using a cyclical learning rate scheduler to produce multiple models for aggregation in the weight space. Our method demonstrates significant improvements over the model souping approach across multiple datasets (around 6% gain in HAM10000 and CheXpert datasets) while maintaining low computational costs for model generation and selection. Moreover, we achieve better results on OOD datasets than model soups. The code is available at https://github.com/BioMedIA-MBZUAI/FissionFusion.
☆ Adaptive Critical Subgraph Mining for Cognitive Impairment Conversion Prediction with T1-MRI-based Brain Network
Prediction the conversion to early-stage dementia is critical for mitigating its progression but remains challenging due to subtle cognitive impairments and structural brain changes. Traditional T1-weighted magnetic resonance imaging (T1-MRI) research focus on identifying brain atrophy regions but often fails to address the intricate connectivity between them. This limitation underscores the necessity of focuing on inter-regional connectivity for a comprehensive understand of the brain's complex network. Moreover, there is a pressing demand for methods that adaptively preserve and extract critical information, particularly specialized subgraph mining techniques for brain networks. These are essential for developing high-quality feature representations that reveal critical spatial impacts of structural brain changes and its topology. In this paper, we propose Brain-SubGNN, a novel graph representation network to mine and enhance critical subgraphs based on T1-MRI. This network provides a subgraph-level interpretation, enhancing interpretability and insights for graph analysis. The process begins by extracting node features and a correlation matrix between nodes to construct a task-oriented brain network. Brain-SubGNN then adaptively identifies and enhances critical subgraphs, capturing both loop and neighbor subgraphs. This method reflects the loop topology and local changes, indicative of long-range connections, and maintains local and global brain attributes. Extensive experiments validate the effectiveness and advantages of Brain-SubGNN, demonstrating its potential as a powerful tool for understanding and diagnosing early-stage dementia. Source code is available at https://github.com/Leng-10/Brain-SubGNN.
comment: 11 pages
☆ Learning Novel View Synthesis from Heterogeneous Low-light Captures
Neural radiance field has achieved fundamental success in novel view synthesis from input views with the same brightness level captured under fixed normal lighting. Unfortunately, synthesizing novel views remains to be a challenge for input views with heterogeneous brightness level captured under low-light condition. The condition is pretty common in the real world. It causes low-contrast images where details are concealed in the darkness and camera sensor noise significantly degrades the image quality. To tackle this problem, we propose to learn to decompose illumination, reflectance, and noise from input views according to that reflectance remains invariant across heterogeneous views. To cope with heterogeneous brightness and noise levels across multi-views, we learn an illumination embedding and optimize a noise map individually for each view. To allow intuitive editing of the illumination, we design an illumination adjustment module to enable either brightening or darkening of the illumination component. Comprehensive experiments demonstrate that this approach enables effective intrinsic decomposition for low-light multi-view noisy images and achieves superior visual quality and numerical performance for synthesizing novel views compared to state-of-the-art methods.
☆ AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving
As an essential task in autonomous driving (AD), motion prediction aims to predict the future states of surround objects for navigation. One natural solution is to estimate the position of other agents in a step-by-step manner where each predicted time-step is conditioned on both observed time-steps and previously predicted time-steps, i.e., autoregressive prediction. Pioneering works like SocialLSTM and MFP design their decoders based on this intuition. However, almost all state-of-the-art works assume that all predicted time-steps are independent conditioned on observed time-steps, where they use a single linear layer to generate positions of all time-steps simultaneously. They dominate most motion prediction leaderboards due to the simplicity of training MLPs compared to autoregressive networks. In this paper, we introduce the GPT style next token prediction into motion forecasting. In this way, the input and output could be represented in a unified space and thus the autoregressive prediction becomes more feasible. However, different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. To this end, we propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations, e.g., encoding the transformation between coordinate systems for spatial relativity while adopting RoPE for temporal relativity. Empirically, by equipping with the aforementioned tailored designs, the proposed method achieves state-of-the-art performance in the Waymo Open Motion and Waymo Interaction datasets. Notably, AMP outperforms other recent autoregressive motion prediction methods: MotionLM and StateTransformer, which demonstrates the effectiveness of the proposed designs.
☆ Efficient scene text image super-resolution with semantic guidance
Scene text image super-resolution has significantly improved the accuracy of scene text recognition. However, many existing methods emphasize performance over efficiency and ignore the practical need for lightweight solutions in deployment scenarios. Faced with the issues, our work proposes an efficient framework called SGENet to facilitate deployment on resource-limited platforms. SGENet contains two branches: super-resolution branch and semantic guidance branch. We apply a lightweight pre-trained recognizer as a semantic extractor to enhance the understanding of text information. Meanwhile, we design the visual-semantic alignment module to achieve bidirectional alignment between image features and semantics, resulting in the generation of highquality prior guidance. We conduct extensive experiments on benchmark dataset, and the proposed SGENet achieves excellent performance with fewer computational costs. Code is available at https://github.com/SijieLiu518/SGENet
☆ Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion
High-quality scene reconstruction and novel view synthesis based on Gaussian Splatting (3DGS) typically require steady, high-quality photographs, often impractical to capture with handheld cameras. We present a method that adapts to camera motion and allows high-quality scene reconstruction with handheld video data suffering from motion blur and rolling shutter distortion. Our approach is based on detailed modelling of the physical image formation process and utilizes velocities estimated using visual-inertial odometry (VIO). Camera poses are considered non-static during the exposure time of a single image frame and camera poses are further optimized in the reconstruction process. We formulate a differentiable rendering pipeline that leverages screen space approximation to efficiently incorporate rolling-shutter and motion blur effects into the 3DGS framework. Our results with both synthetic and real data demonstrate superior performance in mitigating camera motion over existing methods, thereby advancing 3DGS in naturalistic settings.
comment: Source code available at https://github.com/SpectacularAI/3dgs-deblur
☆ Out-of-Distribution Detection Using Peer-Class Generated by Large Language Model
Out-of-distribution (OOD) detection is a critical task to ensure the reliability and security of machine learning models deployed in real-world applications. Conventional methods for OOD detection that rely on single-modal information, often struggle to capture the rich variety of OOD instances. The primary difficulty in OOD detection arises when an input image has numerous similarities to a particular class in the in-distribution (ID) dataset, e.g., wolf to dog, causing the model to misclassify it. Nevertheless, it may be easy to distinguish these classes in the semantic domain. To this end, in this paper, a novel method called ODPC is proposed, in which specific prompts to generate OOD peer classes of ID semantics are designed by a large language model as an auxiliary modality to facilitate detection. Moreover, a contrastive loss based on OOD peer classes is devised to learn compact representations of ID classes and improve the clarity of boundaries between different classes. The extensive experiments on five benchmark datasets show that the method we propose can yield state-of-the-art results.
☆ DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation
Dataset distillation is an advanced technique aimed at compressing datasets into significantly smaller counterparts, while preserving formidable training performance. Significant efforts have been devoted to promote evaluation accuracy under limited compression ratio while overlooked the robustness of distilled dataset. In this work, we introduce a comprehensive benchmark that, to the best of our knowledge, is the most extensive to date for evaluating the adversarial robustness of distilled datasets in a unified way. Our benchmark significantly expands upon prior efforts by incorporating a wider range of dataset distillation methods, including the latest advancements such as TESLA and SRe2L, a diverse array of adversarial attack methods, and evaluations across a broader and more extensive collection of datasets such as ImageNet-1K. Moreover, we assessed the robustness of these distilled datasets against representative adversarial attack algorithms like PGD and AutoAttack, while exploring their resilience from a frequency perspective. We also discovered that incorporating distilled data into the training batches of the original dataset can yield to improvement of robustness.
☆ HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling
The integration of diverse clinical modalities such as medical imaging and the tabular data obtained by the patients' Electronic Health Records (EHRs) is a crucial aspect of modern healthcare. The integrative analysis of multiple sources can provide a comprehensive understanding of a patient's condition and can enhance diagnoses and treatment decisions. Deep Neural Networks (DNNs) consistently showcase outstanding performance in a wide range of multimodal tasks in the medical domain. However, the complex endeavor of effectively merging medical imaging with clinical, demographic and genetic information represented as numerical tabular data remains a highly active and ongoing research pursuit. We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR's values and measurements. This approach aims to leverage the complementary information present in these modalities to enhance the accuracy of various medical applications. We demonstrate the strength and the generality of our method on two different brain Magnetic Resonance Imaging (MRI) analysis tasks, namely, brain age prediction conditioned by subject's sex, and multiclass Alzheimer's Disease (AD) classification conditioned by tabular data. We show that our framework outperforms both single-modality models and state-of-the-art MRI-tabular data fusion methods. The code, enclosed to this manuscript will be made publicly available.
comment: 17 pages, 8 figures
☆ PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, even GPT-4V cannot solve more than half of the puzzles. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future (Our data and code will be released publicly at https://github.com/declare-lab/LLM-PuzzleTest).
☆ LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment
Language-guided scene-aware human motion generation has great significance for entertainment and robotics. In response to the limitations of existing datasets, we introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation, and can also facilitate the development of real-life applications. Moreover, to generate semantically consistent and physically plausible human motions, we propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.
☆ DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception CVPR 2024
Current perceptive models heavily depend on resource-intensive datasets, prompting the need for innovative solutions. Leveraging recent advances in diffusion models, synthetic data, by constructing image inputs from various annotations, proves beneficial for downstream tasks. While prior methods have separately addressed generative and perceptive models, DetDiffusion, for the first time, harmonizes both, tackling the challenges in generating effective data for perceptive models. To enhance image generation with perceptive models, we introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability. To boost the performance of specific perceptive models, our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation. Experimental results from the object detection task highlight DetDiffusion's superior performance, establishing a new state-of-the-art in layout-guided generation. Furthermore, image syntheses from DetDiffusion can effectively augment training data, significantly enhancing downstream detection performance.
comment: Accepted to CVPR 2024
☆ Rotary Position Embedding for Vision Transformer
Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit
comment: 20 pages, 5 figures
♻ ☆ AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes
Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures.
♻ ☆ Magic-Me: Identity-Specific Video Customized Diffusion
Creating content with specified identities (ID) has attracted significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven creation has achieved great progress with the identity controlled via reference images. However, its extension to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation at the initialization stage for stable video outputs. To achieve this, we propose three novel components that are essential for high-quality identity preservation and stable video generation: 1) a noise initialization method with 3D Gaussian Noise Prior for better inter-frame stability; 2) an ID module based on extended Textual Inversion trained with the cropped identity to disentangle the ID information from the background 3) Face VCD and Tiled VCD modules to reinforce faces and upscale the video to higher resolution while preserving the identity's features. We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines. Besides, with the transferability of the encoded identity in the ID module, VCD is also working well with personalized text-to-image models available publicly. The codes are available at https://github.com/Zhen-Dong/Magic-Me.
comment: Project Page at https://magic-me-webpage.github.io
♻ ☆ m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
♻ ☆ TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models
Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.
♻ ☆ PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.
comment: 27 pages, 12 figures
♻ ☆ Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels NeurIPS 2023
Intersection over Union (IoU) losses are surrogates that directly optimize the Jaccard index. Leveraging IoU losses as part of the loss function have demonstrated superior performance in semantic segmentation tasks compared to optimizing pixel-wise losses such as the cross-entropy loss alone. However, we identify a lack of flexibility in these losses to support vital training techniques like label smoothing, knowledge distillation, and semi-supervised learning, mainly due to their inability to process soft labels. To address this, we introduce Jaccard Metric Losses (JMLs), which are identical to the soft Jaccard loss in standard settings with hard labels but are fully compatible with soft labels. We apply JMLs to three prominent use cases of soft labels: label smoothing, knowledge distillation and semi-supervised learning, and demonstrate their potential to enhance model accuracy and calibration. Our experiments show consistent improvements over the cross-entropy loss across 4 semantic segmentation datasets (Cityscapes, PASCAL VOC, ADE20K, DeepGlobe Land) and 13 architectures, including classic CNNs and recent vision transformers. Remarkably, our straightforward approach significantly outperforms state-of-the-art knowledge distillation and semi-supervised learning methods. The code is available at \href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.
comment: NeurIPS 2023
♻ ☆ Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering
Modeling dynamic, large-scale urban scenes is challenging due to their highly intricate geometric structures and unconstrained dynamics in both space and time. Prior methods often employ high-level architectural priors, separating static and dynamic elements, resulting in suboptimal capture of their synergistic interactions. To address this challenge, we present a unified representation model, called Periodic Vibration Gaussian (PVG). PVG builds upon the efficient 3D Gaussian splatting technique, originally designed for static scene representation, by introducing periodic vibration-based temporal dynamics. This innovation enables PVG to elegantly and uniformly represent the characteristics of various objects and elements in dynamic urban scenes. To enhance temporally coherent and large scene representation learning with sparse training data, we introduce a novel temporal smoothing mechanism and a position-aware adaptive control strategy respectively. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes. Notably, PVG achieves this without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG exhibits 900-fold acceleration in rendering over the best alternative.
comment: Project page: https://fudan-zvg.github.io/PVG/
♻ ☆ Normalizing flow-based deep variational Bayesian network for seismic multi-hazards and impacts estimation from InSAR imagery
Onsite disasters like earthquakes can trigger cascading hazards and impacts, such as landslides and infrastructure damage, leading to catastrophic losses; thus, rapid and accurate estimates are crucial for timely and effective post-disaster responses. Interferometric Synthetic aperture radar (InSAR) data is important in providing high-resolution onsite information for rapid hazard estimation. Most recent methods using InSAR imagery signals predict a single type of hazard and thus often suffer low accuracy due to noisy and complex signals induced by co-located hazards, impacts, and irrelevant environmental changes (e.g., vegetation changes, human activities). We introduce a novel stochastic variational inference with normalizing flows derived to jointly approximate posteriors of multiple unobserved hazards and impacts from noisy InSAR imagery.
comment: This paper needs to be reviewed by the USGS
♻ ☆ Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR) by accessing both the source and target data. Considering privacy policies or transmission restrictions of source data in practical scenarios, we propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue, i.e., adapt a source-trained model to a target domain with only unlabeled target data. SODA-SR leverages the source-trained model to generate refined pseudo-labels for teacher-student learning. To better utilize pseudo-labels, we propose a novel wavelet-based augmentation method, named Wavelet Augmentation Transformer (WAT), which can be flexibly incorporated with existing networks, to implicitly produce useful augmented data. WAT learns low-frequency information of varying levels across diverse samples, which is aggregated efficiently via deformable attention. Furthermore, an uncertainty-aware self-training mechanism is proposed to improve the accuracy of pseudo-labels, with inaccurate predictions being rectified by uncertainty estimation. To acquire better SR results and avoid overfitting pseudo-labels, several regularization losses are proposed to constrain target LR and SR images in the frequency domain. Experiments show that without accessing source data, SODA-SR outperforms state-of-the-art UDA methods in both synthetic$\rightarrow$real and real$\rightarrow$real adaptation settings, and is not constrained by specific network architectures.
comment: 11 pages, 7 figures, 3 tables
♻ ☆ Towards Architecture-Agnostic Untrained Network Priors for Image Reconstruction with Frequency Regularization
Untrained networks inspired by deep image prior have shown promising capabilities in recovering a high-quality image from noisy or partial measurements, without requiring training data. Their success has been widely attributed to the spectral bias acting as an implicit regularization induced by suitable network architectures. However, applications of such network-based priors often entail superfluous architectural decisions, overfitting risks, and slow optimization, all of which hinder their practicality. In this work, we propose efficient, architecture-agnostic methods for a more direct frequency control over the network priors: 1) constraining the bandwidth of the white-noise input, 2) controlling the bandwidth of the interpolation-based upsamplers, and 3) regularizing the Lipschitz constants of the layers. We show that even with just one extra line of code, the overfitting issues in underperforming architectures can be alleviated such that their performance gaps with the high-performing counterparts can be largely closed despite their distinct configurations, mitigating the need for architecture tuning. This then makes it possible to employ a more compact model to achieve similar or superior performance to larger models with greater efficiency. Our regularized network priors compare favorably with current supervised and self-supervised methods on MRI reconstruction and image inpainting tasks, serving as a stronger zero-shot baseline reconstructor. Our code will be made publicly available.
♻ ☆ Simple Semantic-Aided Few-Shot Learning CVPR 2024
Learning from a limited amount of data, namely Few-Shot Learning, stands out as a challenging computer vision task. Several works exploit semantics and design complicated semantic fusion mechanisms to compensate for rare representative features within restricted data. However, relying on naive semantics such as class names introduces biases due to their brevity, while acquiring extensive semantics from external knowledge takes a huge time and effort. This limitation severely constrains the potential of semantics in few-shot learning. In this paper, we design an automatic way called Semantic Evolution to generate high-quality semantics. The incorporation of high-quality semantics alleviates the need for complex network structures and learning algorithms used in previous works. Hence, we employ a simple two-layer network termed Semantic Alignment Network to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification. The experimental results show our framework outperforms all previous methods on six benchmarks, demonstrating a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks. Code is available at https://github.com/zhangdoudou123/SemFew.
comment: Accepted by CVPR 2024
♻ ☆ MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning ICLR2024
Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC
comment: Accepted by ICLR2024
♻ ☆ Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration
Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.
comment: 13 pages, 8 figures, 9 tables
♻ ☆ Auto-Vocabulary Semantic Segmentation
Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-Language Models. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require users to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce \textit{Auto-Vocabulary Semantic Segmentation (AVS)}, advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, \ours, presents a framework that autonomously identifies relevant class names using enhanced BLIP embeddings, which are utilized for segmentation afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS and showcases competitive performance to OVS methods that require specified class names.
♻ ☆ CoNeS: Conditional neural fields with shift modulation for multi-sequence MRI translation
Multi-sequence magnetic resonance imaging (MRI) has found wide applications in both modern clinical studies and deep learning research. However, in clinical practice, it frequently occurs that one or more of the MRI sequences are missing due to different image acquisition protocols or contrast agent contraindications of patients, limiting the utilization of deep learning models trained on multi-sequence data. One promising approach is to leverage generative models to synthesize the missing sequences, which can serve as a surrogate acquisition. State-of-the-art methods tackling this problem are based on convolutional neural networks (CNN) which usually suffer from spectral biases, resulting in poor reconstruction of high-frequency fine details. In this paper, we propose Conditional Neural fields with Shift modulation (CoNeS), a model that takes voxel coordinates as input and learns a representation of the target images for multi-sequence MRI translation. The proposed model uses a multi-layer perceptron (MLP) instead of a CNN as the decoder for pixel-to-pixel mapping. Hence, each target image is represented as a neural field that is conditioned on the source image via shift modulation with a learned latent code. Experiments on BraTS 2018 and an in-house clinical dataset of vestibular schwannoma patients showed that the proposed method outperformed state-of-the-art methods for multi-sequence MRI translation both visually and quantitatively. Moreover, we conducted spectral analysis, showing that CoNeS was able to overcome the spectral bias issue common in conventional CNN models. To further evaluate the usage of synthesized images in clinical downstream tasks, we tested a segmentation network using the synthesized images at inference.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:004
♻ ☆ Dice Semimetric Losses: Optimizing the Dice Score with Soft Labels MICCAI 2023
The soft Dice loss (SDL) has taken a pivotal role in numerous automated segmentation pipelines in the medical imaging community. Over the last years, some reasons behind its superior functioning have been uncovered and further optimizations have been explored. However, there is currently no implementation that supports its direct utilization in scenarios involving soft labels. Hence, a synergy between the use of SDL and research leveraging the use of soft labels, also in the context of model calibration, is still missing. In this work, we introduce Dice semimetric losses (DMLs), which (i) are by design identical to SDL in a standard setting with hard labels, but (ii) can be employed in settings with soft labels. Our experiments on the public QUBIQ, LiTS and KiTS benchmarks confirm the potential synergy of DMLs with soft labels (e.g. averaging, label smoothing, and knowledge distillation) over hard labels (e.g. majority voting and random selection). As a result, we obtain superior Dice scores and model calibration, which supports the wider adoption of DMLs in practice. The code is available at https://github.com/zifuwanggg/JDTLosses
comment: MICCAI 2023
♻ ☆ Poly Kernel Inception Network for Remote Sensing Detection
Object detection in remote sensing images (RSIs) often suffers from several increasing challenges, including the large variation in object scales and the diverse-ranging context. Prior methods tried to address these challenges by expanding the spatial receptive field of the backbone, either through large-kernel convolution or dilated convolution. However, the former typically introduces considerable background noise, while the latter risks generating overly sparse feature representations. In this paper, we introduce the Poly Kernel Inception Network (PKINet) to handle the above challenges. PKINet employs multi-scale convolution kernels without dilation to extract object features of varying scales and capture local context. In addition, a Context Anchor Attention (CAA) module is introduced in parallel to capture long-range contextual information. These two components work jointly to advance the performance of PKINet on four challenging remote sensing detection benchmarks, namely DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R.
comment: accepted by IEEE Conference on Computer Vision and Pattern Recognition, 2024
♻ ☆ Weakly supervised segmentation of intracranial aneurysms using a novel 3D focal modulation UNet
Accurate identification and quantification of unruptured intracranial aneurysms (UIAs) is crucial for the risk assessment and treatment of this cerebrovascular disorder. Current 2D manual assessment on 3D magnetic resonance angiography (MRA) is suboptimal and time-consuming. In addition, one major issue in medical image segmentation is the need for large well-annotated data, which can be expensive to obtain. Techniques that mitigate this requirement, such as weakly supervised learning with coarse labels are highly desirable. In the paper, we propose FocalSegNet, a novel 3D focal modulation UNet, to detect an aneurysm and offer an initial, coarse segmentation of it from time-of-flight MRA image patches, which is further refined with a dense conditional random field (CRF) post-processing layer to produce a final segmentation map. We trained and evaluated our model on a public dataset, and in terms of UIA detection, our model showed a low false-positive rate of 0.21 and a high sensitivity of 0.80. For voxel-wise aneurysm segmentation, we achieved a Dice score of 0.68 and a 95% Hausdorff distance of ~0.95 mm, demonstrating its strong performance. We evaluated our algorithms against the state-of-the-art 3D Residual-UNet and Swin-UNETR, and illustrated the superior performance of our proposed FocalSegNet, highlighting the advantages of employing focal modulation for this task.
♻ ☆ View-Consistent 3D Editing with Gaussian Splatting
The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations. Currently, diffusion-based 2D editing models are harnessed to modify multi-view rendered images, which then guide the editing of 3DGS models. However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS. To this end, we introduce View-consistent Editing (VcEdit), a novel framework that seamlessly incorporates 3DGS into image editing processes, ensuring multi-view consistency in edited guidance images and effectively mitigating mode collapse issues. VcEdit employs two innovative consistency modules: the Cross-attention Consistency Module and the Editing Consistency Module, both designed to reduce inconsistencies in edited images. By incorporating these consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency, facilitating high-quality 3DGS editing across a diverse range of scenes.
♻ ☆ DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction CVPR2024
In Multiple Object Tracking, objects often exhibit non-linear motion of acceleration and deceleration, with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion, we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically, for the motion predictor component, we propose a novel Decoupled Diffusion-based Motion Predictor (D$^2$MP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore, it optimizes the diffusion process with much fewer sampling steps. As a MOT tracker, the DiffMOT is real-time at 22.7FPS, and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with $62.3\%$ and $76.2\%$ in HOTA metrics, respectively. To the best of our knowledge, DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction.
comment: CVPR2024
♻ ☆ OSCaR: Object State Captioning and State Change Representation NAACL 2024
The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.
comment: NAACL 2024
♻ ☆ On the Privacy Effect of Data Enhancement via the Lens of Memorization
Machine learning poses severe privacy concerns as it has been shown that the learned models can reveal sensitive information about their training data. Many works have investigated the effect of widely adopted data augmentation and adversarial training techniques, termed data enhancement in the paper, on the privacy leakage of machine learning models. Such privacy effects are often measured by membership inference attacks (MIAs), which aim to identify whether a particular example belongs to the training set or not. We propose to investigate privacy from a new perspective called memorization. Through the lens of memorization, we find that previously deployed MIAs produce misleading results as they are less likely to identify samples with higher privacy risks as members compared to samples with low privacy risks. To solve this problem, we deploy a recent attack that can capture individual samples' memorization degrees for evaluation. Through extensive experiments, we unveil several findings about the connections between three essential properties of machine learning models, including privacy, generalization gap, and adversarial robustness. We demonstrate that the generalization gap and privacy leakage are less correlated than those of the previous results. Moreover, there is not necessarily a trade-off between adversarial robustness and privacy as stronger adversarial robustness does not make the model more susceptible to privacy attacks.
comment: Accepted by IEEE TIFS, 17 pages
♻ ☆ Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset.
comment: 8 pages,3 figures
♻ ☆ Surfer: Progressive Reasoning with World Models for Robotic Manipulation
Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data. However, most existing vision and language robot manipulation methods mainly operate in less realistic simulator and language settings and lack explicit modeling of world knowledge. To bridge this gap, we introduce a novel and simple robot manipulation framework, called Surfer. It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene. Then, the generalization ability of the model on new instructions and new scenes is enhanced by explicit modeling of the action and scene prediction in multi-modal information. In addition to the framework, we also built a robot manipulation simulator that supports full physics execution based on the MuJoCo physics engine. It can automatically generate demonstration training data and test data, effectively reducing labor costs. To conduct a comprehensive and systematic evaluation of the robot manipulation model in terms of language understanding and physical execution, we also created a robotic manipulation benchmark with progressive reasoning tasks, called SeaWave. It contains 4 levels of progressive reasoning tasks and can provide a standardized testing platform for embedded AI agents in multi-modal environments. On average, Surfer achieved a success rate of 54.74% on the defined four levels of manipulation tasks, exceeding the best baseline performance of 47.64%.
♻ ☆ Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at https://github.com/rainy-xu/TALL4Deepfake.
comment: Accepted by IJCV
♻ ☆ Vulnerability analysis of captcha using Deep learning
Several websites improve their security and avoid dangerous Internet attacks by implementing CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), a type of verification to identify whether the end-user is human or a robot. The most prevalent type of CAPTCHA is text-based, designed to be easily recognized by humans while being unsolvable towards machines or robots. However, as deep learning technology progresses, development of convolutional neural network (CNN) models that predict text-based CAPTCHAs becomes easier. The purpose of this research is to investigate the flaws and vulnerabilities in the CAPTCHA generating systems in order to design more resilient CAPTCHAs. To achieve this, we created CapNet, a Convolutional Neural Network. The proposed platform can evaluate both numerical and alphanumerical CAPTCHAs
♻ ☆ Analyzing and Improving the Training Dynamics of Diffusion Models
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
♻ ☆ Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer CVPR 2024
Despite the impressive generative capabilities of diffusion models, existing diffusion model-based style transfer methods require inference-stage optimization (e.g. fine-tuning or textual inversion of style) which is time-consuming, or fails to leverage the generative ability of large-scale diffusion models. To address these issues, we introduce a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization. Specifically, we manipulate the features of self-attention layers as the way the cross-attention mechanism works; in the generation process, substituting the key and value of content with those of style image. This approach provides several desirable characteristics for style transfer including 1) preservation of content by transferring similar styles into similar image patches and 2) transfer of style based on similarity of local texture (e.g. edge) between content and style images. Furthermore, we introduce query preservation and attention temperature scaling to mitigate the issue of disruption of original content, and initial latent Adaptive Instance Normalization (AdaIN) to deal with the disharmonious color (failure to transfer the colors of style). Our experimental results demonstrate that our proposed method surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines.
comment: Accepted to CVPR 2024. Project page: https://jiwoogit.github.io/StyleID_site
♻ ☆ Joint Person Identity, Gender and Age Estimation from Hand Images using Deep Multi-Task Representation Learning
In this paper, we propose a multi-task representation learning framework to jointly estimate the identity, gender and age of individuals from their hand images for the purpose of criminal investigations since the hand images are often the only available information in cases of serious crime such as sexual abuse. We investigate different up-to-date deep learning architectures and compare their performance for joint estimation of identity, gender and age from hand images of perpetrators of serious crime. To simplify the age prediction, we create age groups for the age estimation. We make extensive evaluations and comparisons of both convolution-based and transformer-based deep learning architectures on a publicly available 11k hands dataset. Our experimental analysis shows that it is possible to efficiently estimate not only identity but also other attributes such as gender and age of suspects jointly from hand images for criminal investigations, which is crucial in assisting international police forces in the court to identify and convict abusers.
comment: arXiv admin note: text overlap with arXiv:2209.04821
♻ ☆ iComMa: Inverting 3D Gaussian Splatting for Camera Pose Estimation via Comparing and Matching
We present a method named iComMa to address the 6D camera pose estimation problem in computer vision. Conventional pose estimation methods typically rely on the target's CAD model or necessitate specific network training tailored to particular object classes. Some existing methods have achieved promising results in mesh-free object and scene pose estimation by inverting the Neural Radiance Fields (NeRF). However, they still struggle with adverse initializations such as large rotations and translations. To address this issue, we propose an efficient method for accurate camera pose estimation by inverting 3D Gaussian Splatting (3DGS). Specifically, a gradient-based differentiable framework optimizes camera pose by minimizing the residual between the query image and the rendered image, requiring no training. An end-to-end matching module is designed to enhance the model's robustness against adverse initializations, while minimizing pixel-level comparing loss aids in precise pose estimation. Experimental results on synthetic and complex real-world data demonstrate the effectiveness of the proposed approach in challenging conditions and the accuracy of camera pose estimation.
♻ ☆ IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
Video Action Counting (VAC) is crucial in analyzing sports, fitness, and everyday activities by quantifying repetitive actions in videos. However, traditional VAC methods have overlooked the complexity of action repetitions, such as interruptions and the variability in cycle duration. Our research addresses the shortfall by introducing a novel approach to VAC, called Irregular Video Action Counting (IVAC). IVAC prioritizes modeling irregular repetition patterns in videos, which we define through two primary aspects: Inter-cycle Consistency and Cycle-interval Inconsistency. Inter-cycle Consistency ensures homogeneity in the spatial-temporal representations of cycle segments, signifying action uniformity within cycles. Cycle-interval inconsistency highlights the importance of distinguishing between cycle segments and intervals based on their inherent content differences. To encapsulate these principles, we propose a new methodology that includes consistency and inconsistency modules, supported by a unique pull-push loss (P2L) mechanism. The IVAC-P2L model applies a pull loss to promote coherence among cycle segment features and a push loss to clearly distinguish features of cycle segments from interval segments. Empirical evaluations conducted on the RepCount dataset demonstrate that the IVAC-P2L model sets a new benchmark in VAC task performance. Furthermore, the model demonstrates exceptional adaptability and generalization across various video contents, outperforming existing models on two additional datasets, UCFRep and Countix, without the need for dataset-specific optimization. These results confirm the efficacy of our approach in addressing irregular repetitions in videos and pave the way for further advancements in video analysis and understanding.
comment: Source code: https://github.com/hwang-cs-ime/IVAC-P2L
♻ ☆ Enhanced Face Authentication With Separate Loss Functions
The overall objective of the main project is to propose and develop a system of facial authentication in unlocking phones or applications in phones using facial recognition. The system will include four separate architectures: face detection, face recognition, face spoofing, and classification of closed eyes. In which, we consider the problem of face recognition to be the most important, determining the true identity of the person standing in front of the screen with absolute accuracy is what facial recognition systems need to achieve. Along with the development of the face recognition problem, the problem of the anti-fake face is also gradually becoming popular and equally important. Our goal is to propose and develop two loss functions: LMCot and Double Loss. Then apply them to the face authentication process.
comment: in Vietnamese language
♻ ☆ Impact of Synthetic Images on Morphing Attack Detection Using a Siamese Network
This paper evaluated the impact of synthetic images on Morphing Attack Detection (MAD) using a Siamese network with a semi-hard-loss function. Intra and cross-dataset evaluations were performed to measure synthetic image generalisation capabilities using a cross-dataset for evaluation. Three different pre-trained networks were used as feature extractors from traditional MobileNetV2, MobileNetV3 and EfficientNetB0. Our results show that MAD trained on EfficientNetB0 from FERET, FRGCv2, and FRLL can reach a lower error rate in comparison with SOTA. Conversely, worse performances were reached when the system was trained only with synthetic images. A mixed approach (synthetic + digital) database may help to improve MAD and reduce the error rate. This fact shows that we still need to keep going with our efforts to include synthetic images in the training process.
comment: Arxiv version of CIARP2023 - fixed typo errors
♻ ☆ Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides
Digital pathology enables automatic analysis of histopathological sections using artificial intelligence (AI). Automatic evaluation could improve diagnostic efficiency and help find associations between morphological features and clinical outcome. For development of such prediction models, identifying invasive epithelial cells, and separating these from benign epithelial cells and in situ lesions would be the first step. In this study, we aimed to develop an AI model for segmentation of epithelial cells in sections from breast cancer. We generated epithelial ground truth masks by restaining hematoxylin and eosin (HE) sections with cytokeratin (CK) AE1/AE3, and by pathologists' annotations. HE/CK image pairs were used to train a convolutional neural network, and data augmentation was used to make the model more robust. Tissue microarrays (TMAs) from 839 patients, and whole slide images from two patients were used for training and evaluation of the models. The sections were derived from four cohorts of breast cancer patients. TMAs from 21 patients from a fifth cohort was used as a second test set. In quantitative evaluation, a mean Dice score of 0.70, 0.79, and 0.75 for invasive epithelial cells, benign epithelial cells, and in situ lesions, respectively, were achieved. In qualitative scoring (0-5) by pathologists, results were best for all epithelium and invasive epithelium, with scores of 4.7 and 4.4. Scores for benign epithelium and in situ lesions were 3.7 and 2.0. The proposed model segmented epithelial cells in HE stained breast cancer slides well, but further work is needed for accurate division between the classes. Immunohistochemistry, together with pathologists' annotations, enabled the creation of accurate ground truths. The model is made freely available in FastPathology and the code is available at https://github.com/AICAN-Research/breast-epithelium-segmentation
comment: 19 pages, 6 figures. Submitted to a scientific journal
♻ ☆ MoST: Motion Style Transformer between Diverse Action Contents CVPR 2024
While existing motion style transfer methods are effective between two motions with identical content, their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge, we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with `part-attentive style modulator across body parts' and `Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality, particularly in motion pairs with different contents, without the need for heuristic post-processing. Codes are available at https://github.com/Boeun-Kim/MoST.
comment: Accepted by CVPR 2024
♻ ☆ AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning
Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text. Text consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute word bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts. The implementation code can be available at https://github.com/bhrqw/AttriCLIP.
♻ ☆ Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation
In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as StableCamH. The key idea is to leverage cars found on the road as sources of scale supervision but to incorporate them in the training robustly. StableCamH detects and estimates the sizes of cars in the frame and aggregates scale information extracted from them into a camera height estimate whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network to become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and Cityscapes datasets show the effectiveness of StableCamH and its state-of-the-art accuracy compared with related methods. We also show that StableCamH enables training on mixed datasets of different camera heights, which leads to larger-scale training and thus higher generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and StableCamH democratizes its deployment by establishing the means to train any model as a metric depth estimator.
♻ ☆ End-to-end Learned Visual Odometry with Events and Frames
Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. To improve robustness, recent model-based VO systems have begun combining standard and event-based cameras. Event cameras excel in low-light and high-speed motion, while standard cameras provide dense and easier-to-track features, even in low-textured areas. However, the field of image- and event-based VO still predominantly relies on model-based methods and is yet to fully integrate recent image-only advancements leveraging end-to-end learning-based architectures. Seamlessly integrating the two modalities remains challenging due to their different nature, one asynchronous, the other not, limiting the potential for a more effective image- and event-based VO. We introduce RAMP-VO, the first end-to-end learned image- and event-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders capable of fusing asynchronous events with image data, providing 8x faster inference and 33% more accurate predictions than existing solutions. Despite being trained only in simulation, RAMP-VO outperforms image- and event-based methods by 46% and 60%, respectively, on traditional, real-world benchmarks as well as newly introduced Apollo and Malapert landing sequences, paving the way for robust and asynchronous VO in space.
comment: 8 pages, 5 figures, 4 tables
♻ ☆ QUASAR: QUality and Aesthetics Scoring with Advanced Representations
This paper introduces a new data-driven, non-parametric method for image quality and aesthetics assessment, surpassing existing approaches and requiring no prompt engineering or fine-tuning. We eliminate the need for expressive textual embeddings by proposing efficient image anchors in the data. Through extensive evaluations of 7 state-of-the-art self-supervised models, our method demonstrates superior performance and robustness across various datasets and benchmarks. Notably, it achieves high agreement with human assessments even with limited data and shows high robustness to the nature of data and their pre-processing pipeline. Our contributions offer a streamlined solution for assessment of images while providing insights into the perception of visual information.
♻ ☆ A Hybrid Transformer-Sequencer approach for Age and Gender classification from in-wild facial images
The advancements in computer vision and image processing techniques have led to emergence of new application in the domain of visual surveillance, targeted advertisement, content-based searching, and human-computer interaction etc. Out of the various techniques in computer vision, face analysis, in particular, has gained much attention. Several previous studies have tried to explore different applications of facial feature processing for a variety of tasks, including age and gender classification. However, despite several previous studies having explored the problem, the age and gender classification of in-wild human faces is still far from the achieving the desired levels of accuracy required for real-world applications. This paper, therefore, attempts to bridge this gap by proposing a hybrid model that combines self-attention and BiLSTM approaches for age and gender classification problems. The proposed models performance is compared with several state-of-the-art model proposed so far. An improvement of approximately 10percent and 6percent over the state-of-the-art implementations for age and gender classification, respectively, are noted for the proposed model. The proposed model is thus found to achieve superior performance and is found to provide a more generalized learning. The model can, therefore, be applied as a core classification component in various image processing and computer vision problems.
comment: 22 pages
♻ ☆ DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes
We present DrivingGaussian, an efficient and effective framework for surrounding dynamic autonomous driving scenes. For complex scenes with moving objects, we first sequentially and progressively model the static background of the entire scene with incremental static 3D Gaussians. We then leverage a composite dynamic Gaussian graph to handle multiple moving objects, individually reconstructing each object and restoring their accurate positions and occlusion relationships within the scene. We further use a LiDAR prior for Gaussian Splatting to reconstruct scenes with greater details and maintain panoramic consistency. DrivingGaussian outperforms existing methods in dynamic driving scene reconstruction and enables photorealistic surround-view synthesis with high-fidelity and multi-camera consistency. Our project page is at: https://github.com/VDIGPKU/DrivingGaussian.
♻ ☆ AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models
Existing customization methods require access to multiple reference examples to align pre-trained diffusion probabilistic models (DPMs) with user-provided concepts. This paper aims to address the challenge of DPM customization when the only available supervision is a differentiable metric defined on the generated contents. Since the sampling procedure of DPMs involves recursive calls to the denoising UNet, na\"ive gradient backpropagation requires storing the intermediate states of all iterations, resulting in extremely high memory consumption. To overcome this issue, we propose a novel method AdjointDPM, which first generates new samples from diffusion models by solving the corresponding probability-flow ODEs. It then uses the adjoint sensitivity method to backpropagate the gradients of the loss to the models' parameters (including conditioning signals, network weights, and initial noises) by solving another augmented ODE. To reduce numerical errors in both the forward generation and gradient backpropagation processes, we further reparameterize the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. Finally, we demonstrate the effectiveness of AdjointDPM on three interesting tasks: converting visual effects into identification text embeddings, finetuning DPMs for specific types of stylization, and optimizing initial noise to generate adversarial samples for security auditing.
♻ ☆ KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation CVPR 2024
In this paper, we present KP-RED, a unified KeyPoint-driven REtrieval and Deformation framework that takes object scans as input and jointly retrieves and deforms the most geometrically similar CAD models from a pre-processed database to tightly match the target. Unlike existing dense matching based methods that typically struggle with noisy partial scans, we propose to leverage category-consistent sparse keypoints to naturally handle both full and partial object scans. Specifically, we first employ a lightweight retrieval module to establish a keypoint-based embedding space, measuring the similarity among objects by dynamically aggregating deformation-aware local-global features around extracted keypoints. Objects that are close in the embedding space are considered similar in geometry. Then we introduce the neural cage-based deformation module that estimates the influence vector of each keypoint upon cage vertices inside its local support region to control the deformation of the retrieved shape. Extensive experiments on the synthetic dataset PartNet and the real-world dataset Scan2CAD demonstrate that KP-RED surpasses existing state-of-the-art approaches by a large margin. Codes and trained models will be released in https://github.com/lolrudy/KP-RED.
comment: Accepted by CVPR 2024
♻ ☆ Learning to Produce Semi-dense Correspondences for Visual Localization CVPR 2024
This study addresses the challenge of performing visual localization in demanding conditions such as night-time scenarios, adverse weather, and seasonal changes. While many prior studies have focused on improving image-matching performance to facilitate reliable dense keypoint matching between images, existing methods often heavily rely on predefined feature points on a reconstructed 3D model. Consequently, they tend to overlook unobserved keypoints during the matching process. Therefore, dense keypoint matches are not fully exploited, leading to a notable reduction in accuracy, particularly in noisy scenes. To tackle this issue, we propose a novel localization method that extracts reliable semi-dense 2D-3D matching points based on dense keypoint matches. This approach involves regressing semi-dense 2D keypoints into 3D scene coordinates using a point inference network. The network utilizes both geometric and visual cues to effectively infer 3D coordinates for unobserved keypoints from the observed ones. The abundance of matching information significantly enhances the accuracy of camera pose estimation, even in scenarios involving noisy or sparse 3D models. Comprehensive evaluations demonstrate that the proposed method outperforms other methods in challenging scenes and achieves competitive results in large-scale visual localization benchmarks. The code will be available.
comment: Accepted at CVPR 2024
♻ ☆ Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
Instruction tuning data is essential for training the Multimodal Large Language Models (MLLMs). However, the creation of high-quality instruction tuning data presents significant challenges. Prior methods that depended on GPT-4 for data generation were not only costly but also lacked satisfactory performance in complex tasks (i.e., grounding-based reasoning tasks). To address these issues, we developed an innovative data generation pipeline, Genixer, to generate various high-quality instruction tuning data, including nine representative tasks, e.g., Common VQA, REC, REG, and PointQ. Specifically, Genixer provides a unified solution with four key steps for alleviating the difficulty of data generation: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLM, and (iv) data generation and filtering. Subsequently, the superior qualitative results of our Genixer demonstrate that current MLLMs have a strong potential to evolve into powerful data generators. Additionally, to validate the efficacy of generated data quantitatively, we add the instruction tuning data produced by Genixer into the training of two representative MLLMs and observe the consistent improvements on various VQA tasks and multimodal benchmarks.
comment: Technical report
♻ ☆ Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks
Current RGBT tracking research relies on the complete multi-modal input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: \href{https://github.com/Alexadlu/Modality-missing-RGBT-Tracking.git}{https://github.com/Alexadlu/Modality-missing-RGBT-Tracking.git}.
♻ ☆ EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation
Semantic segmentation has innately relied on extensive pixel-level annotated data, leading to the emergence of unsupervised methodologies. Among them, leveraging self-supervised Vision Transformers for unsupervised semantic segmentation (USS) has been making steady progress with expressive deep features. Yet, for semantically segmenting images with complex objects, a predominant challenge remains: the lack of explicit object-level semantic encoding in patch-level features. This technical limitation often leads to inadequate segmentation of complex objects with diverse structures. To address this gap, we present a novel approach, EAGLE, which emphasizes object-centric representation learning for unsupervised semantic segmentation. Specifically, we introduce EiCue, a spectral technique providing semantic and structural cues through an eigenbasis derived from the semantic similarity matrix of deep image features and color affinity from an image. Further, by incorporating our object-centric contrastive loss with EiCue, we guide our model to learn object-level representations with intra- and inter-image object-feature consistency, thereby enhancing semantic accuracy. Extensive experiments on COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art USS results of EAGLE with accurate and consistent semantic segmentation across complex scenes.
♻ ☆ Towards Effective Multiple-in-One Image Restoration: A Sequential and Prompt Learning Strategy
While single task image restoration (IR) has achieved significant successes, it remains a challenging issue to train a single model which can tackle multiple IR tasks. In this work, we investigate in-depth the multiple-in-one (MiO) IR problem, which comprises seven popular IR tasks. We point out that MiO IR faces two pivotal challenges: the optimization of diverse objectives and the adaptation to multiple tasks. To tackle these challenges, we present two simple yet effective strategies. The first strategy, referred to as sequential learning, attempts to address how to optimize the diverse objectives, which guides the network to incrementally learn individual IR tasks in a sequential manner rather than mixing them together. The second strategy, i.e., prompt learning, attempts to address how to adapt to the different IR tasks, which assists the network to understand the specific task and improves the generalization ability. By evaluating on 19 test sets, we demonstrate that the sequential and prompt learning strategies can significantly enhance the MiO performance of commonly used CNN and Transformer backbones. Our experiments also reveal that the two strategies can supplement each other to learn better degradation representations and enhance the model robustness. It is expected that our proposed MiO IR formulation and strategies could facilitate the research on how to train IR models with higher generalization capabilities.
♻ ☆ Posterior Distillation Sampling
We introduce Posterior Distillation Sampling (PDS), a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods, which leverage the powerful 2D prior of diffusion models to handle various parametric images, have mainly focused on generation. Unlike generation, editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space, we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target, enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source's identity. We demonstrate that this optimization resembles running a generative process with the target attribute, but aligning this process with the trajectory of the source's generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces.
comment: Project page: https://posterior-distillation-sampling.github.io/
♻ ☆ StyleHumanCLIP: Text-guided Garment Manipulation for StyleGAN-Human
This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. Our latent code mapper adopts an attention mechanism that adaptively manipulates individual latent codes on different StyleGAN layers under text guidance. In addition, we introduce feature-space masking at inference time to avoid unwanted changes caused by text inputs. Our quantitative and qualitative evaluations reveal that our method can control generated images more faithfully to given texts than existing methods.
comment: VISIAPP 2024, project page: https://www.cgg.cs.tsukuba.ac.jp/~yoshikawa/pub/style_human_clip/
♻ ☆ MEDBind: Unifying Language and Multimodal Medical Data Embeddings
Medical vision-language pretraining models (VLPM) have achieved remarkable progress in fusing chest X-rays (CXR) with clinical texts, introducing image-text data binding approaches that enable zero-shot learning and downstream clinical tasks. However, the current landscape lacks the holistic integration of additional medical modalities, such as electrocardiograms (ECG). We present MEDBind (Medical Electronic patient recorD), which learns joint embeddings across CXR, ECG, and medical text. Using text data as the central anchor, MEDBind features tri-modality binding, delivering competitive performance in top-K retrieval, zero-shot, and few-shot benchmarks against established VLPM, and the ability for CXR-to-ECG zero-shot classification and retrieval. This seamless integration is achieved through combination of contrastive loss on modality-text pairs with our proposed contrastive loss function, Edge-Modality Contrastive Loss, fostering a cohesive embedding space for CXR, ECG, and text. Finally, we demonstrate that MEDBind can improve downstream tasks by directly integrating CXR and ECG embeddings into a large-language model for multimodal prompt tuning.
♻ ☆ SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation
We present a cascaded diffusion model based on a part-level implicit 3D representation. Our model achieves state-of-the-art generation quality and also enables part-level shape editing and manipulation without any additional training in conditional setup. Diffusion models have demonstrated impressive capabilities in data generation as well as zero-shot completion and editing via a guided reverse process. Recent research on 3D diffusion models has focused on improving their generation capabilities with various data representations, while the absence of structural information has limited their capability in completion and editing tasks. We thus propose our novel diffusion model using a part-level implicit representation. To effectively learn diffusion with high-dimensional embedding vectors of parts, we propose a cascaded framework, learning diffusion first on a low-dimensional subspace encoding extrinsic parameters of parts and then on the other high-dimensional subspace encoding intrinsic attributes. In the experiments, we demonstrate the outperformance of our method compared with the previous ones both in generation and part-level completion and manipulation tasks.
comment: Project page: https://salad3d.github.io
Information Retrieval
☆ Leveraging High-Resolution Features for Improved Deep Hashing-based Image Retrieval
Deep hashing techniques have emerged as the predominant approach for efficient image retrieval. Traditionally, these methods utilize pre-trained convolutional neural networks (CNNs) such as AlexNet and VGG-16 as feature extractors. However, the increasing complexity of datasets poses challenges for these backbone architectures in capturing meaningful features essential for effective image retrieval. In this study, we explore the efficacy of employing high-resolution features learned through state-of-the-art techniques for image retrieval tasks. Specifically, we propose a novel methodology that utilizes High-Resolution Networks (HRNets) as the backbone for the deep hashing task, termed High-Resolution Hashing Network (HHNet). Our approach demonstrates superior performance compared to existing methods across all tested benchmark datasets, including CIFAR-10, NUS-WIDE, MS COCO, and ImageNet. This performance improvement is more pronounced for complex datasets, which highlights the need to learn high-resolution features for intricate image retrieval tasks. Furthermore, we conduct a comprehensive analysis of different HRNet configurations and provide insights into the optimal architecture for the deep hashing task
☆ No more optimization rules: LLM-enabled policy-based multi-modal query optimizer (version 1)
Large language model (LLM) has marked a pivotal moment in the field of machine learning and deep learning. Recently its capability for query planning has been investigated, including both single-modal and multi-modal queries. However, there is no work on the query optimization capability of LLM. As a critical (or could even be the most important) step that significantly impacts the execution performance of the query plan, such analysis and attempts should not be missed. From another aspect, existing query optimizers are usually rule-based or rule-based + cost-based, i.e., they are dependent on manually created rules to complete the query plan rewrite/transformation. Given the fact that modern optimizers include hundreds to thousands of rules, designing a multi-modal query optimizer following a similar way is significantly time-consuming since we will have to enumerate as many multi-modal optimization rules as possible, which has not been well addressed today. In this paper, we investigate the query optimization ability of LLM and use LLM to design LaPuda, a novel LLM and Policy based multi-modal query optimizer. Instead of enumerating specific and detailed rules, LaPuda only needs a few abstract policies to guide LLM in the optimization, by which much time and human effort are saved. Furthermore, to prevent LLM from making mistakes or negative optimization, we borrow the idea of gradient descent and propose a guided cost descent (GCD) algorithm to perform the optimization, such that the optimization can be kept in the correct direction. In our evaluation, our methods consistently outperform the baselines in most cases. For example, the optimized plans generated by our methods result in 1~3x higher execution speed than those by the baselines.
comment: Yifan and Haodi contribute equally to the work
☆ A Large Language Model Enhanced Sequential Recommender for Joint Video and Comment Recommendation
In online video platforms, reading or writing comments on interesting videos has become an essential part of the video watching experience. However, existing video recommender systems mainly model users' interaction behaviors with videos, lacking consideration of comments in user behavior modeling. In this paper, we propose a novel recommendation approach called LSVCR by leveraging user interaction histories with both videos and comments, so as to jointly conduct personalized video and comment recommendation. Specifically, our approach consists of two key components, namely sequential recommendation (SR) model and supplemental large language model (LLM) recommender. The SR model serves as the primary recommendation backbone (retained in deployment) of our approach, allowing for efficient user preference modeling. Meanwhile, we leverage the LLM recommender as a supplemental component (discarded in deployment) to better capture underlying user preferences from heterogeneous interaction behaviors. In order to integrate the merits of the SR model and the supplemental LLM recommender, we design a twostage training paradigm. The first stage is personalized preference alignment, which aims to align the preference representations from both components, thereby enhancing the semantics of the SR model. The second stage is recommendation-oriented fine-tuning, in which the alignment-enhanced SR model is fine-tuned according to specific objectives. Extensive experiments in both video and comment recommendation tasks demonstrate the effectiveness of LSVCR. Additionally, online A/B testing on the KuaiShou platform verifies the actual benefits brought by our approach. In particular, we achieve a significant overall gain of 4.13% in comment watch time.
☆ A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels
Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges -- enforcing the multimodal samples to \emph{align incorrect semantics} and \emph{widen the heterogeneous gap}, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a Unified framework based on Optimal Transport (OT) for Robust Cross-modal Retrieval. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multi-modal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these two components leverage the inherent correlation among multi-modal data to facilitate effective cost function. The experiments on three widely-used cross-modal retrieval datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ DESIRE-ME: Domain-Enhanced Supervised Information REtrieval using Mixture-of-Experts ECIR 2024
Open-domain question answering requires retrieval systems able to cope with the diverse and varied nature of questions, providing accurate answers across a broad spectrum of query types and topics. To deal with such topic heterogeneity through a unique model, we propose DESIRE-ME, a neural information retrieval model that leverages the Mixture-of-Experts framework to combine multiple specialized neural models. We rely on Wikipedia data to train an effective neural gating mechanism that classifies the incoming query and that weighs the predictions of the different domain-specific experts correspondingly. This allows DESIRE-ME to specialize adaptively in multiple domains. Through extensive experiments on publicly available datasets, we show that our proposal can effectively generalize domain-enhanced neural models. DESIRE-ME excels in handling open-domain questions adaptively, boosting by up to 12% in NDCG@10 and 22% in P@1, the underlying state-of-the-art dense retrieval model.
comment: Accepted at the 46th European Conference on Information Retrieval (ECIR 2024)
☆ USE: Dynamic User Modeling with Stateful Sequence Models
User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.
☆ Harnessing Large Language Models for Text-Rich Sequential Recommendation
Recent advances in Large Language Models (LLMs) have been changing the paradigm of Recommender Systems (RS). However, when items in the recommendation scenarios contain rich textual information, such as product descriptions in online shopping or news headlines on social media, LLMs require longer texts to comprehensively depict the historical user behavior sequence. This poses significant challenges to LLM-based recommenders, such as over-length limitations, extensive time and space overheads, and suboptimal model performance. To this end, in this paper, we design a novel framework for harnessing Large Language Models for Text-Rich Sequential Recommendation (LLM-TRSR). Specifically, we first propose to segment the user historical behaviors and subsequently employ an LLM-based summarizer for summarizing these user behavior blocks. Particularly, drawing inspiration from the successful application of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) models in user modeling, we introduce two unique summarization techniques in this paper, respectively hierarchical summarization and recurrent summarization. Then, we construct a prompt text encompassing the user preference summary, recent user interactions, and candidate item information into an LLM-based recommender, which is subsequently fine-tuned using Supervised Fine-Tuning (SFT) techniques to yield our final recommendation model. We also use Low-Rank Adaptation (LoRA) for Parameter-Efficient Fine-Tuning (PEFT). We conduct experiments on two public datasets, and the results clearly demonstrate the effectiveness of our approach.
☆ Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval
With the explosive growth of multi-modal information on the Internet, unimodal search cannot satisfy the requirement of Internet applications. Text-image retrieval research is needed to realize high-quality and efficient retrieval between different modalities. Existing text-image retrieval research is mostly based on general vision-language datasets (e.g. MS-COCO, Flickr30K), in which the query utterance is rigid and unnatural (i.e. verbosity and formality). To overcome the shortcoming, we construct a new Compact and Fragmented Query challenge dataset (named Flickr30K-CFQ) to model text-image retrieval task considering multiple query content and style, including compact and fine-grained entity-relation corpus. We propose a novel query-enhanced text-image retrieval method using prompt engineering based on LLM. Experiments show that our proposed Flickr30-CFQ reveals the insufficiency of existing vision-language datasets in realistic text-image tasks. Our LLM-based Query-enhanced method applied on different existing text-image retrieval models improves query understanding performance both on public dataset and our challenge set Flickr30-CFQ with over 0.9% and 2.4% respectively. Our project can be available anonymously in https://sites.google.com/view/Flickr30K-cfq.
☆ A Semantic Search Engine for Mathlib4
The interactive theorem prover, Lean, enables the verification of formal mathematical proofs and is backed by an expanding community. Central to this ecosystem is its mathematical library, mathlib4, which lays the groundwork for the formalization of an expanding range of mathematical theories. However, searching for theorems in mathlib4 can be challenging. To successfully search in mathlib4, users often need to be familiar with its naming conventions or documentation strings. Therefore, creating a semantic search engine that can be used easily by individuals with varying familiarity with mathlib4 is very important. In this paper, we present a semantic search engine for mathlib4 that accepts informal queries and finds the relevant theorems. We also establish a benchmark for assessing the performance of various search engines for mathlib4.
☆ An Analysis on Matching Mechanisms and Token Pruning for Late-interaction Models
With the development of pre-trained language models, the dense retrieval models have become promising alternatives to the traditional retrieval models that rely on exact match and sparse bag-of-words representations. Different from most dense retrieval models using a bi-encoder to encode each query or document into a dense vector, the recently proposed late-interaction multi-vector models (i.e., ColBERT and COIL) achieve state-of-the-art retrieval effectiveness by using all token embeddings to represent documents and queries and modeling their relevance with a sum-of-max operation. However, these fine-grained representations may cause unacceptable storage overhead for practical search systems. In this study, we systematically analyze the matching mechanism of these late-interaction models and show that the sum-of-max operation heavily relies on the co-occurrence signals and some important words in the document. Based on these findings, we then propose several simple document pruning methods to reduce the storage overhead and compare the effectiveness of different pruning methods on different late-interaction models. We also leverage query pruning methods to further reduce the retrieval latency. We conduct extensive experiments on both in-domain and out-domain datasets and show that some of the used pruning methods can significantly improve the efficiency of these late-interaction models without substantially hurting their retrieval effectiveness.
comment: Accepted by ACM Transactions on Information Systems
☆ Improving Legal Case Retrieval with Brain Signals
The tasks of legal case retrieval have received growing attention from the IR community in the last decade. Relevance feedback techniques with implicit user feedback (e.g., clicks) have been demonstrated to be effective in traditional search tasks (e.g., Web search). In legal case retrieval, however, collecting relevance feedback faces a couple of challenges that are difficult to resolve under existing feedback paradigms. First, legal case retrieval is a complex task as users often need to understand the relationship between legal cases in detail to correctly judge their relevance. Traditional feedback signal such as clicks is too coarse to use as they do not reflect any fine-grained relevance information. Second, legal case documents are usually long, users often need even tens of minutes to read and understand them. Simple behavior signal such as clicks and eye-tracking fixations can hardly be useful when users almost click and examine every part of the document. In this paper, we explore the possibility of solving the feedback problem in legal case retrieval with brain signal. Recent advances in brain signal processing have shown that human emotional can be collected in fine grains through Brain-Machine Interfaces (BMI) without interrupting the users in their tasks. Therefore, we propose a framework for legal case retrieval that uses EEG signal to optimize retrieval results. We collected and create a legal case retrieval dataset with users EEG signal and propose several methods to extract effective EEG features for relevance feedback. Our proposed features achieve a 71% accuracy for feedback prediction with an SVM-RFE model, and our proposed ranking method that takes into account the diverse needs of users can significantly improve user satisfaction for legal case retrieval. Experiment results show that re-ranked result list make user more satisfied.
comment: 11pages, 8 figures
♻ ☆ Investigating the Effects of Sparse Attention on Cross-Encoders ECIR'24
Cross-encoders are effective passage and document re-rankers but less efficient than other neural or classic retrieval models. A few previous studies have applied windowed self-attention to make cross-encoders more efficient. However, these studies did not investigate the potential and limits of different attention patterns or window sizes. We close this gap and systematically analyze how token interactions can be reduced without harming the re-ranking effectiveness. Experimenting with asymmetric attention and different window sizes, we find that the query tokens do not need to attend to the passage or document tokens for effective re-ranking and that very small window sizes suffice. In our experiments, even windows of 4 tokens still yield effectiveness on par with previous cross-encoders while reducing the memory requirements by at least 22% / 59% and being 1% / 43% faster at inference time for passages / documents.
comment: Accepted at ECIR'24
♻ ☆ WaZI: A Learned and Workload-aware Z-Index EDBT 2024
Learned indexes fit machine learning (ML) models to the data and use them to make query operations more time and space-efficient. Recent works propose using learned spatial indexes to improve spatial query performance by optimizing the storage layout or internal search structures according to the data distribution. However, only a few learned indexes exploit the query workload distribution to enhance their performance. In addition, building and updating learned spatial indexes are often costly on large datasets due to the inefficiency of (re)training ML models. In this paper, we present WaZI, a learned and workload-aware variant of the Z-index, which jointly optimizes the storage layout and search structures, as a viable solution for the above challenges of spatial indexing. Specifically, we first formulate a cost function to measure the performance of a Z-index on a dataset for a range-query workload. Then, we optimize the Z-index structure by minimizing the cost function through adaptive partitioning and ordering for index construction. Moreover, we design a novel page-skipping mechanism to improve the query performance of WaZI by reducing access to irrelevant data pages. Our extensive experiments show that the WaZI index improves range query time by 40% on average over the baselines while always performing better or comparably to state-of-the-art spatial indexes. Additionally, it also maintains good point query performance. Generally, WaZI provides favorable tradeoffs among query latency, construction time, and index size.
comment: Camera-ready version accepted to EDBT 2024
♻ ☆ ERASE: Benchmarking Feature Selection Methods for Deep Recommender Systems
Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction.
♻ ☆ LLatrieval: LLM-Verified Retrieval for Verifiable Generation NAACL 2024
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents, which enables the user to flexibly verify the answer and makes the LLM's output more reliable. Retrieval plays a crucial role in verifiable generation. Specifically, the retrieved documents not only supplement knowledge to help the LLM generate correct answers, but also serve as supporting evidence for the user to verify the LLM's output. However, the widely used retrievers become the bottleneck of the entire pipeline and limit the overall performance. Their capabilities are usually inferior to LLMs since they often have much fewer parameters than the large language model and have not been demonstrated to scale well to the size of LLMs. If the retriever does not correctly find the supporting documents, the LLM can not generate the correct and verifiable answer, which overshadows the LLM's remarkable abilities. To address these limitations, we propose \LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question. Thus, the LLM can iteratively provide feedback to retrieval and facilitate the retrieval result to fully support verifiable generation. Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
comment: Accepted by NAACL 2024 (Main Conference)
♻ ☆ An Aligning and Training Framework for Multimodal Recommendations
With the development of multimedia applications, multimodal recommendations are playing an essential role, as they can leverage rich contexts beyond user interactions. Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID features; however, there exist semantic gaps among multimodal content features and ID features, for which directly using multimodal information as an auxiliary would lead to misalignment in representations of users and items. In this paper, we first systematically investigate the misalignment issue in multimodal recommendations, and propose a solution named AlignRec. In AlignRec, the recommendation objective is decomposed into three alignments, namely alignment within contents, alignment between content and categorical ID, and alignment between users and items. Each alignment is characterized by a specific objective function and is integrated into our multimodal recommendation framework. To effectively train our AlignRec, we propose starting from pre-training the first alignment to obtain unified multimodal features and subsequently training the following two alignments together with these features as input. As it is essential to analyze whether each multimodal feature helps in training, we design three new classes of metrics to evaluate intermediate performance. Our extensive experiments on three real-world datasets consistently verify the superiority of AlignRec compared to nine baselines. We also find that the multimodal features generated by AlignRec are better than currently used ones, which are to be open-sourced.
comment: 11 pages, add some necessary explanations, revise typos
♻ ☆ I^3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval
Passage retrieval is a fundamental task in many information systems, such as web search and question answering, where both efficiency and effectiveness are critical concerns. In recent years, neural retrievers based on pre-trained language models (PLM), such as dual-encoders, have achieved huge success. Yet, studies have found that the performance of dual-encoders are often limited due to the neglecting of the interaction information between queries and candidate passages. Therefore, various interaction paradigms have been proposed to improve the performance of vanilla dual-encoders. Particularly, recent state-of-the-art methods often introduce late-interaction during the model inference process. However, such late-interaction based methods usually bring extensive computation and storage cost on large corpus. Despite their effectiveness, the concern of efficiency and space footprint is still an important factor that limits the application of interaction-based neural retrieval models. To tackle this issue, we incorporate implicit interaction into dual-encoders, and propose I^3 retriever. In particular, our implicit interaction paradigm leverages generated pseudo-queries to simulate query-passage interaction, which jointly optimizes with query and passage encoders in an end-to-end manner. It can be fully pre-computed and cached, and its inference process only involves simple dot product operation of the query vector and passage vector, which makes it as efficient as the vanilla dual encoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep Learning Datasets, demonstrating the I^3 retriever's superiority in terms of both effectiveness and efficiency. Moreover, the proposed implicit interaction is compatible with special pre-training and knowledge distillation for passage retrieval, which brings a new state-of-the-art performance.
comment: 10 pages
♻ ☆ RecMind: Large Language Model Powered Agent For Recommendation NAACL 2024
While the recommendation system (RS) has advanced significantly through deep learning, current RS approaches usually train and fine-tune models on task-specific datasets, limiting their generalizability to new recommendation tasks and their ability to leverage external knowledge due to model scale and data size constraints. Thus, we designed an LLM-powered autonomous recommender agent, RecMind, which is capable of leveraging external knowledge, utilizing tools with careful planning to provide zero-shot personalized recommendations. We propose a Self-Inspiring algorithm to improve the planning ability. At each intermediate step, the LLM self-inspires to consider all previously explored states to plan for the next step. This mechanism greatly improves the model's ability to comprehend and utilize historical information in planning for recommendation. We evaluate RecMind's performance in various recommendation scenarios. Our experiment shows that RecMind outperforms existing zero/few-shot LLM-based recommendation baseline methods in various tasks and achieves comparable performance to a fully trained recommendation model P5.
comment: Accepted by NAACL 2024 (Findings)
Signal Processing
Overview of Publicly Available Degradation Data Sets for Tasks within Prognostics and Health Management
Central to the efficacy of prognostics and health management methods is the acquisition and analysis of degradation data, which encapsulates the evolving health condition of engineering systems over time. Degradation data serves as a rich source of information, offering invaluable insights into the underlying degradation processes, failure modes, and performance trends of engineering systems. This paper provides an overview of publicly available degradation data sets.
☆ MIMO Channel as a Neural Function: Implicit Neural Representations for Extreme CSI Compression in Massive MIMO Systems
Acquiring and utilizing accurate channel state information (CSI) can significantly improve transmission performance, thereby holding a crucial role in realizing the potential advantages of massive multiple-input multiple-output (MIMO) technology. Current prevailing CSI feedback approaches improve precision by employing advanced deep-learning methods to learn representative CSI features for a subsequent compression process. Diverging from previous works, we treat the CSI compression problem in the context of implicit neural representations. Specifically, each CSI matrix is viewed as a neural function that maps the CSI coordinates (antenna number and subchannel) to the corresponding channel gains. Instead of transmitting the parameters of the implicit neural functions directly, we transmit modulations based on the CSI matrix derived through a meta-learning algorithm. Modulations are then applied to a shared base network to generate the elements of the CSI matrix. Modulations corresponding to the CSI matrix are quantized and entropy-coded to further reduce the communication bandwidth, thus achieving extreme CSI compression ratios. Numerical results show that our proposed approach achieves state-of-the-art performance and showcases flexibility in feedback strategies.
☆ Densify & Conquer: Densified, smaller base-stations can conquer the increasing carbon footprint problem in nextG wireless
Connectivity on-the-go has been one of the most impressive technological achievements in the 2010s decade. However, multiple studies show that this has come at an expense of increased carbon footprint, that also rivals the entire aviation sector's carbon footprint. The two major contributors of this increased footprint are (a) smartphone batteries which affect the embodied footprint and (b) base-stations that occupy ever-increasing energy footprint to provide the last mile wireless connectivity to smartphones. The root-cause of both these turn out to be the same, which is communicating over the last-mile lossy wireless medium. We show in this paper, titled DensQuer, how base-station densification, which is to replace a single larger base-station with multiple smaller ones, reduces the effect of the last-mile wireless, and in effect conquers both these adverse sources of increased carbon footprint. Backed by a open-source ray-tracing computation framework (Sionna), we show how a strategic densification strategy can minimize the number of required smaller base-stations to practically achievable numbers, which lead to about 3x power-savings in the base-station network. Also, DensQuer is able to also reduce the required deployment height of base-stations to as low as 15m, that makes the smaller cells easily deployable on trees/street poles instead of requiring a dedicated tower. Further, by utilizing newly introduced hardware power rails in Google Pixel 7a and above phones, we also show that this strategic densified network leads to reduction in mobile transmit power by 10-15 dB, leading to about 3x reduction in total cellular power consumption, and about 50% increase in smartphone battery life when it communicates data via the cellular network.
comment: 12 pages, 14 figures
☆ Movable Antenna Enabled Interference Network: Joint Antenna Position and Beamforming Design
This paper investigates the utility of movable antenna (MA) assistance for the multiple-input single-output (MISO) interference channel. We exploit an additional design degree of freedom provided by MA to enhance the desired signal and suppress interference so as to reduce the total transmit power of interference network. To this end, we jointly optimize the MA positions and transmit beamforming, subject to the signal-to-interference-plus-noise ratio constraints of users. To address the non-convex optimization problem, we propose an efficient iterative algorithm to alternately optimize the MA positions via successive convex approximation method and the transmit beamforming via second-order cone program approach. Numerical results demonstrate that the proposed MA-enabled MISO interference network outperforms its conventional counterpart without MA, which significantly enhances the capability of inter-cell frequency reuse and reduces the complexity of transmitter design.
☆ BanglaNum -- A Public Dataset for Bengali Digit Recognition from Speech
Automatic speech recognition (ASR) converts the human voice into readily understandable and categorized text or words. Although Bengali is one of the most widely spoken languages in the world, there have been very few studies on Bengali ASR, particularly on Bangladeshi-accented Bengali. In this study, audio recordings of spoken digits (0-9) from university students were used to create a Bengali speech digits dataset that may be employed to train artificial neural networks for voice-based digital input systems. This paper also compares the Bengali digit recognition accuracy of several Convolutional Neural Networks (CNNs) using spectrograms and shows that a test accuracy of 98.23% is achievable using parameter-efficient models such as SqueezeNet on our dataset.
☆ An Extended Kuramoto Model for Frequency and Phase Synchronization in Delay-Free Networks with Finite Number of Agents
Due to its description of a synchronization between oscillators, the Kuramoto model is an ideal choice for a synchronisation algorithm in networked systems. This requires to achieve not only a frequency synchronization but also a phase synchronization - something the standard Kuramoto model can not provide for a finite number of agents. In this case, a remaining phase difference is necessary to offset differences of the natural frequencies. Setting the Kuramoto model into the context of dynamic consensus and making use of the $n$th order discrete average consensus algorithm, this paper extends the standard Kuramoto model in such a way that frequency and phase synchronization are separated. This in turn leads to an algorithm achieve the required frequency and phase synchronization also for a finite number of agents. Simulations show the viability of this extended Kuramoto model.
comment: 8 pages, 6 figures. Shorter version submitted to the 63rd IEEE Conference on Decision and Control, 2024. Funded by BMBF through 6GEM research hub (16KISK038) and by DFG through project JCRS CoMP (504990291)
☆ Massive MIMO CSI Feedback using Channel Prediction: How to Avoid Machine Learning at UE?
In the literature, machine learning (ML) has been implemented at the base station (BS) and user equipment (UE) to improve the precision of downlink channel state information (CSI). However, ML implementation at the UE can be infeasible for various reasons, such as UE power consumption. Motivated by this issue, we propose a CSI learning mechanism at BS, called CSILaBS, to avoid ML at UE. To this end, by exploiting channel predictor (CP) at BS, a light-weight predictor function (PF) is considered for feedback evaluation at the UE. CSILaBS reduces over-the-air feedback overhead, improves CSI quality, and lowers the computation cost of UE. Besides, in a multiuser environment, we propose various mechanisms to select the feedback by exploiting PF while aiming to improve CSI accuracy. We also address various ML-based CPs, such as NeuralProphet (NP), an ML-inspired statistical algorithm. Furthermore, inspired to use a statistical model and ML together, we propose a novel hybrid framework composed of a recurrent neural network and NP, which yields better prediction accuracy than individual models. The performance of CSILaBS is evaluated through an empirical dataset recorded at Nokia Bell-Labs. The outcomes show that ML elimination at UE can retain performance gains, for example, precoding quality.
comment: 14 pages, 11 figures
☆ Superposed IM-OFDM (S-IM-OFDM): An Enhanced OFDM for Integrated Sensing and Communications
Integrated sensing and communications (ISAC) is a critical enabler for emerging 6G applications, and at its core lies in the dual-functional waveform design. While orthogonal frequency division multiplexing (OFDM) has been a popular basic waveform, its primitive version falls short in sensing due to the inherent unregulated auto-correlation properties. Furthermore, the sensitivity to Doppler shift hinders its broader applications in dynamic scenarios. To address these issues, we propose a superposed index-modulated OFDM (S-IM-OFDM). The proposed scheme improves the sensing performance without excess power consumption by translating the energy efficiency of IM-OFDM onto sensing-oriented signals over OFDM. Also, it maintains excellent communication performance in time-varying channels by leveraging the sensed parameters to compensate for Doppler. Compared to conventional OFDM, the proposed S-IM-OFDM waveform exhibits better sensing capabilities and wider applicability in dynamic scenarios. Both theoretical analyses and simulations corroborate its dual benefits.
☆ Modeling RIS from Electromagnetic Principles to Communication Systems--Part II: System-Level Simulation, Ray Tracing, and Measurement
In this paper, we systematically study the electromagnetic (EM) and communication aspects of an RIS through EM simulations, system-level and ray-tracing simulations, and finally measurements. We simulate a nearly perfect, lossless RIS, and a realistic lossy anomalous reflector (AR) in different ray tracers and analyze the large-scale fading of simple RIS-assisted links. We also compare the results with continuous and quantized unit cell reflection phases with one to four-bit resolutions. Finally, we perform over-the-air communication link measurements in an indoor setting with a manufactured sample of a wide-angle AR. The EM, system-level, and ray-tracing simulation results show good agreement with the measurement results. It is proved that the introduced macroscopic model of RIS from the EM aspects is consistent with our proposed communication models, both for an ideal RIS and a realistic AR.
comment: 10 pages, 14 figures
☆ Maximum Likelihood Alternating Summation for Multistatic Angle-based Multitarget Localization
Recent advancements in Wi-Fi sensing have sparked interest in exploiting OFDM modulated communication signals for target detection and tracking. In this study, we address the angle-based localization of multiple targets using a multistatic OFDM radar. While the maximum likelihood approach optimally merges data from each radar pair comprised by the system, it entails a complex multi-dimensional search process. Leveraging pre-estimation of the targets' parameters obtained via the MUSIC algorithm, our method decouples this multi-dimensional search into a single two-dimensional estimator per target. The proposed alternating summation method allows the computation of a combined likelihood map aggregating contributions from each radar pair, enabling target detection via peak selection. Besides reducing computational complexity, the method effectively captures target interactions and accommodates varying radar pair localization abilities. Also, it requires transmitting only the estimated channel covariance matrices of each radar pair to the central processor. Numerical simulations demonstrate superior performance over existing approaches.
comment: 5 Pages, 3 figures, submitted to EUSIPCO2024. arXiv admin note: text overlap with arXiv:2402.13118
☆ Human Detection in Realistic Through-the-Wall Environments using Raw Radar ADC Data and Parametric Neural Networks
The radar signal processing algorithm is one of the core components in through-wall radar human detection technology. Traditional algorithms (e.g., DFT and matched filtering) struggle to adaptively handle low signal-to-noise ratio echo signals in challenging and dynamic real-world through-wall application environments, which becomes a major bottleneck in the system. In this paper, we introduce an end-to-end through-wall radar human detection network (TWP-CNN), which takes raw radar Analog-to-Digital Converter (ADC) signals without any preprocessing as input. We replace the conventional radar signal processing flow with the proposed DFT-based adaptive feature extraction (DAFE) module. This module employs learnable parameterized 3D complex convolution layers to extract superior feature representations from ADC signals, which is beyond the limitation of traditional preprocessing methods. Additionally, by embedding phase information from radar data within the network and employing multi-task learning, a more accurate detection is achieved. Finally, due to the absence of through-wall radar datasets containing raw ADC data, we gathered a realistic through-wall (RTW) dataset using our in-house developed through-wall radar system. We trained and validated our proposed method on this dataset to confirm its effectiveness and superiority in real through-wall detection scenarios.
comment: 11pages,13figures
♻ ☆ Calibration of Deep Learning Classification Models in fNIRS
Functional near-infrared spectroscopy (fNIRS) is a valuable non-invasive tool for monitoring brain activity. The classification of fNIRS data in relation to conscious activity holds significance for advancing our understanding of the brain and facilitating the development of brain-computer interfaces (BCI). Many researchers have turned to deep learning to tackle the classification challenges inherent in fNIRS data due to its strong generalization and robustness. In the application of fNIRS, reliability is really important, and one mathematical formulation of the reliability of confidence is calibration. However, many researchers overlook the important issue of calibration. To address this gap, we propose integrating calibration into fNIRS field and assess the reliability of existing models. Surprisingly, our results indicate poor calibration performance in many proposed models. To advance calibration development in the fNIRS field, we summarize three practical tips. Through this letter, we hope to emphasize the critical role of calibration in fNIRS research and argue for enhancing the reliability of deep learning-based predictions in fNIRS classification tasks. All data from our experimental process are openly available on GitHub.
♻ ☆ Simple But Effective: Rethinking the Ability of Deep Learning in fNIRS to Exclude Abnormal Input
Functional near-infrared spectroscopy (fNIRS) is a non-invasive technique for monitoring brain activity. To better understand the brain, researchers often use deep learning to address the classification challenges of fNIRS data. Our study shows that while current networks in fNIRS are highly accurate for predictions within their training distribution, they falter at identifying and excluding abnormal data which is out-of-distribution, affecting their reliability. We propose integrating metric learning and supervised methods into fNIRS research to improve networks capability in identifying and excluding out-of-distribution outliers. This method is simple yet effective. In our experiments, it significantly enhances the performance of various networks in fNIRS, particularly transformer-based one, which shows the great improvement in reliability. We will make our experiment data available on GitHub.
♻ ☆ Design of Efficient Point-Mass Filter with Application in Terrain Aided Navigation
This paper deals with state estimation of stochastic models with linear state dynamics, continuous or discrete in time. The emphasis is laid on a numerical solution to the state prediction by the time-update step of the grid-point-based point-mass filter (PMF), which is the most computationally demanding part of the PMF algorithm. A novel efficient PMF (ePMF) estimator, unifying continuous and discrete, approaches is proposed, designed, and discussed. By numerical illustrations, it is shown, that the proposed ePMF can lead to a time complexity reduction that exceeds 99.9% without compromising accuracy. The MATLAB code of the ePMF is released with this paper.
comment: Pulibshed in proccedings of FUSION 2023. PLEASE cite the published version! The code is also now available at GitHub: https://github.com/IDM-UWB/efficient-PMF
♻ ☆ Energy-Efficient Analog Beamforming for RF-WET with Charging Time Constraint
Internet of Things (IoT) sustainability may hinge on radio frequency wireless energy transfer (RF-WET). However, energy-efficient charging strategies are still needed, motivating our work. Specifically, this letter proposes a time division scheme to efficiently charge low-power devices in an IoT network. For this, a multi-antenna power beacon (PB) drives the devices' energy harvesting circuit to the highest power conversion efficiency point via energy beamforming, thus achieving minimum energy consumption. Herein, we adopt the analog multi-antenna architecture due to its low complexity, cost, and energy consumption. The proposal includes a simple yet accurate model for the transfer characteristic of the energy harvesting circuit, enabling the optimization framework. The results evince the effectiveness of our RF-WET strategy over a benchmark scheme where the PB charges all the IoT devices simultaneously. Furthermore, the performance increases with the number of PB antennas.
comment: 6 pages, 5 figures, accepted for publication in IEEE Transactions on Vehicular Technology
♻ ☆ Near-Field Beam Training: Joint Angle and Range Estimation with DFT Codebook
Prior works on near-field beam training have mostly assumed dedicated polar-domain codebook and on-grid range estimation, which, however, may suffer long training overhead and degraded estimation accuracy. To address these issues, we propose in this paper new and efficient beam training schemes with off-grid range estimation by using conventional discrete Fourier transform (DFT) codebook. Specifically, we first analyze the received beam pattern at the user when far-field beamforming vectors are used for beam scanning, and show an interesting result that this beam pattern contains useful user angle and range information. Then, we propose two efficient schemes to jointly estimate the user angle and range with the DFT codebook. The first scheme estimates the user angle based on a defined angular support and resolves the user range by leveraging an approximated angular support width, while the second scheme estimates the user range by minimizing a power ratio mean square error (MSE) to improve the range estimation accuracy. Finally, numerical simulations show that our proposed schemes greatly reduce the near-field beam training overhead and improve the range estimation accuracy as compared to various benchmark schemes.
comment: This work is submitted to IEEE for possible publication. Our other works on near-field beam training include: 1) two-phase near-field beam training (arXiv:2209.14798), 2) hierarchical near-field beam training (arXiv:2302.12511), and 3) near-field beam management (arXiv:2306.16206)
♻ ☆ SDOA-Net: An Efficient Deep Learning-Based DOA Estimation Network for Imperfect Array
The estimation of direction of arrival (DOA) is a crucial issue in conventional radar, wireless communication, and integrated sensing and communication (ISAC) systems. However, low-cost systems often suffer from imperfect factors, such as antenna position perturbations, mutual coupling effect, inconsistent gains/phases, and non-linear amplifier effect, which can significantly degrade the performance of DOA estimation. This paper proposes a DOA estimation method named super-resolution DOA network (SDOA-Net) based on deep learning (DL) to characterize the realistic array more accurately. Unlike existing DL-based DOA methods, SDOA-Net uses sampled received signals instead of covariance matrices as input to extract data features. Furthermore, SDOA-Net produces a vector that is independent of the DOA of the targets but can be used to estimate their spatial spectrum. Consequently, the same training network can be applied to any number of targets, reducing the complexity of implementation. The proposed SDOA-Net with a low-dimension network structure also converges faster than existing DL-based methods. The simulation results demonstrate that SDOA-Net outperforms existing DOA estimation methods for imperfect arrays. The SDOA-Net code is available online at https://github.com/chenpengseu/SDOA-Net.git.
comment: 11 pages, 15 figures
♻ ☆ MusicHiFi: Fast High-Fidelity Stereo Vocoding
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.
♻ ☆ A New Heterogeneous Hybrid Massive MIMO Receiver with An Intrinsic Ability of Removing Phase Ambiguity of DOA Estimation via Machine Learning
Massive multiple input multiple output (MIMO) antenna arrays eventuate a huge amount of circuit costs and computational complexity. To satisfy the needs of high precision and low cost in future green wireless communication, the conventional Hybrid analog and digital MIMO receive structure emerges a natural choice. But it exists an issue of the phase ambiguity in direction of arrival (DOA) estimation and requires at least two time-slots to complete one-time DOA measurement with the first time-slot generating the set of candidate solutions and the remaining ones to find a true direction by received beamforming over this set. This will lead to a low time-efficiency. To address this problem, a new heterogeneous sub-connected hybrid analog and digital (HAD) MIMO structure is proposed with an intrinsic ability of removing phase ambiguity and a corresponding new framework is developed to implement a rapid high-precision DOA estimation using only single time-slot. This framework consists of two steps: 1) form a set of candidate solutions using existing methods like MUSIC) find the class of the true solutions and compute the class mean. To infer the set of true solutions, we propose two new clustering methods: weight global minimum distance (WGMD) and weight local minimum distance (WLMD). And, we also enhance two classic clustering methods: accelerating local weighted k-means (ALW-K-means) and improved density. Additionally, the corresponding closed-form expression of Cramer-Rao lower bound (CRLB) is derived. Simulation results show that the proposed frameworks using the above four clustering can approach the CRLB at almost all SNR regions except for extremely low SNR. Four clustering methods have an accuracy decreasing order as follows: WGMD, improved DBSCAN, ALW-K-means and WLMD.
♻ ☆ Cross-Domain Dual-Functional OFDM Waveform Design for Accurate Sensing/Positioning
Orthogonal frequency division multiplexing (OFDM) has been widely recognized as the representative waveform for 5G wireless networks, which can directly support sensing/positioning with existing infrastructure. To guarantee superior sensing/positioning accuracy while supporting high-speed communications simultaneously, the dual functions tend to be assigned with different resource elements (REs) due to their diverse design requirements. This motivates optimization of resource allocation/waveform design across time, frequency, power and delay-Doppler domains. Therefore, this article proposes two cross-domain waveform optimization strategies for effective convergence of OFDM-based communications and sensing/positioning, following communication- and sensing-centric criteria, respectively. For the communication-centric design, to maximize the achievable data rate, a fraction of REs are optimally allocated for communications according to prior knowledge of the communication channel. The remaining REs are then employed for sensing/positioning, where the sidelobe level and peak-to-average power ratio are suppressed by optimizing its power-frequency and phase-frequency characteristics for sensing performance improvement. For the sensing-centric design, a `locally' perfect auto-correlation property is ensured for accurate sensing and positioning by adjusting the unit cells of the ambiguity function within its region of interest (RoI). Afterwards, the irrelevant cells beyond RoI, which can readily determine the sensing power allocation, are optimized with the communication power allocation to enhance the achievable data rate. Numerical results demonstrate the superiority of the proposed waveform designs.
♻ ☆ OmniCount: Multi-label Object Counting with Semantic-Geometric Priors
Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a new, more practical approach enabling simultaneous counting of multiple object categories using an open vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging point prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions and heralding a new era in object counting technology.
♻ ☆ Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data
Many modern causal questions ask how treatments affect complex outcomes that are measured using wearable devices and sensors. Current analysis approaches require summarizing these data into scalar statistics (e.g., the mean), but these summaries can be misleading. For example, disparate distributions can have the same means, variances, and other statistics. Researchers can overcome the loss of information by instead representing the data as distributions. We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision-making: Analyzing Distributional Data via Matching After Learning to Stretch (ADD MALTS). We (i) provide analytical guarantees of the correctness of our estimation strategy, (ii) demonstrate via simulation that ADD MALTS outperforms other distributional data analysis methods at estimating treatment effects, and (iii) illustrate ADD MALTS' ability to verify whether there is enough cohesion between treatment and control units within subpopulations to trustworthily estimate treatment effects. We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks.
Sound
☆ UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge
We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.
comment: 5 pages, 3 figures
☆ Recursive Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition
Multi-modal emotion recognition has recently gained a lot of attention since it can leverage diverse and complementary relationships over multiple modalities, such as audio, visual, and text. Most state-of-the-art methods for multimodal fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of the modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial, vocal, and text modalities extracted from videos. Specifically, we propose a recursive cross-modal attention (RCMA) to effectively capture the complementary relationships across the modalities in a recursive fashion. The proposed model is able to effectively capture the inter-modal relationships by computing the cross-attention weights across the individual modalities and the joint representation of the other two modalities. To further improve the inter-modal relationships, the obtained attended features of the individual modalities are again fed as input to the cross-modal attention to refine the feature representations of the individual modalities. In addition to that, we have used Temporal convolution networks (TCNs) to capture the temporal modeling (intra-modal relationships) of the individual modalities. By deploying the TCNs as well cross-modal attention in a recursive fashion, we are able to effectively capture both intra- and inter-modal relationships across the audio, visual, and text modalities. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed fusion model is able to achieve significant improvement over the baseline for the sixth challenge of Affective Behavior Analysis in-the-Wild 2024 (ABAW6) competition.
comment: arXiv admin note: substantial text overlap with arXiv:2209.09068; text overlap with arXiv:2203.14779 by other authors
☆ Vibration Sensitivity of one-port and two-port MEMS microphones
Micro-electro-mechanical system (MEMS) microphones (mics) with two acoustic ports are currently receiving considerable interest, with the promise of achieving higher directional sensitivity compared to traditional one-port architectures. However, measuring pressure differences in two-port microphones typically commands sensing elements that are softer than in one-port mics, and are therefore presumably more prone to interference from external vibration. Here we derive a universal expression for microphone sensitivity to vibration and we experimentally demonstrate its validity for several emerging two-port microphone technologies. We also perform vibration measurements on a one-port mic, thus providing a one-stop direct comparison between one-port and two-port sensing approaches. We find that the acoustically-referred vibration sensitivity of two-port MEMS mics, in units of measured acoustic pressure per external acceleration (i.e., Pascals per g), does not depend on the sensing element stiffness nor on its natural frequency. We also show that this vibration sensitivity in two-port mics is inversely proportional to frequency as opposed to the frequency independent behavior observed in one-port mics. This is confirmed experimentally for several types of microphone packages.
comment: 8 pages, 14 figures
☆ Advanced Long-Content Speech Recognition With Factorized Neural Transducer
In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.
comment: Accepted by TASLP 2024
☆ KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario
This work aims to promote Chinese opera research in both musical and speech domains, with a primary focus on overcoming the data limitations. We introduce KunquDB, a relatively large-scale, well-annotated audio-visual dataset comprising 339 speakers and 128 hours of content. Originating from the Kunqu Opera Art Canon (Kunqu yishu dadian), KunquDB is meticulously structured by dialogue lines, providing explicit annotations including character names, speaker names, gender information, vocal manner classifications, and accompanied by preliminary text transcriptions. KunquDB provides a versatile foundation for role-centric acoustic studies and advancements in speech-related research, including Automatic Speaker Verification (ASV). Beyond enriching opera research, this dataset bridges the gap between artistic expression and technological innovation. Pioneering the exploration of ASV in Chinese opera, we construct four test trials considering two distinct vocal manners in opera voices: stage speech (ST) and singing (S). Implementing domain adaptation methods effectively mitigates domain mismatches induced by these vocal manner variations while there is still room for further improvement as a benchmark.
☆ Building speech corpus with diverse voice characteristics for its prompt-based representation
In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limits the diversity of voice characteristics available. Thus, we aim to address this gap by creating a novel corpus and developing a model for prompt-based manipulation of voice characteristics in text-to-speech synthesis, facilitating a broader range of voice characteristics. Specifically, we propose a method to build a sizable corpus pairing voice characteristics descriptions with corresponding speech samples. This involves automatically gathering voice-related speech data from the Internet, ensuring its quality, and manually annotating it using crowdsourcing. We implement this method with Japanese language data and analyze the results to validate its effectiveness. Subsequently, we propose a construction method of the model to retrieve speech from voice characteristics descriptions based on a contrastive learning method. We train the model using not only conservative contrastive learning but also feature prediction learning to predict quantitative speech features corresponding to voice characteristics. We evaluate the model performance via experiments with the corpus we constructed above.
comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv admin note: text overlap with arXiv:2309.13509
☆ TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer ICASSP2024
Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS.
comment: Accepted by ICASSP2024
☆ Onset and offset weighted loss function for sound event detection
In a typical sound event detection (SED) system, the existence of a sound event is detected at a frame level, and consecutive frames with the same event detected are combined as one sound event. The median filter is applied as a post-processing step to remove detection errors as much as possible. However, detection errors occurring around the onset and offset of a sound event are beyond the capacity of the median filter. To address this issue, an onset and offset weighted binary cross-entropy (OWBCE) loss function is proposed in this paper, which trains the DNN model to be more robust on frames around (a) onsets and offsets. Experiments are carried out in the context of DCASE 2022 task 4. Results show that OWBCE outperforms BCE when different models are considered. For a basic CRNN, relative improvements of 6.43% in event-F1, 1.96% in PSDS1, and 2.43% in PSDS2 can be achieved by OWBCE.
☆ Frequency-aware convolution for sound event detection
In sound event detection (SED), convolution neural networks (CNNs) are widely used to extract time-frequency patterns from the input spectrogram. However, features extracted by CNN can be insensitive to the shift of time-frequency patterns along the frequency axis. To address this issue, frequency dynamic convolution (FDY) has been proposed, which applies different kernels to different frequency components. Compared to the vannila CNN, FDY requires several times more parameters. In this paper, a more efficient solution named frequency-aware convolution (FAC) is proposed. In FAC, frequency-positional information is encoded in a vector and added to the input spectrogram. To match the amplitude of input, the encoding vector is scaled adaptively and channel-independently. Experiments are carried out in the context of DCASE 2022 task 4, and the results demonstrate that FAC can achieve comparable performance to that of FDY with only 515 additional parameters, while FDY requires 8.02 million additional parameters. The ablation study shows that scaling the encoding vector adaptively and channel-independently is critical to the performance of FAC.
♻ ☆ Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset.
comment: 8 pages,3 figures
♻ ☆ MusicHiFi: Fast High-Fidelity Stereo Vocoding
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.
♻ ☆ Can Whisper perform speech-based in-context learning? ICASSP 2024
This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved using Whisper models of any size on two dialects, which is on average 32.3%. A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%. The findings are verified using speaker adaptation or continuous speech recognition tasks, and both achieved considerable relative WER reductions. Detailed quantitative analyses are also provided to shed light on SICL's adaptability to phonological variances and dialect-specific lexical nuances.
comment: Accepted by ICASSP 2024
♻ ☆ Unimodal Aggregation for CTC-based Speech Recognition ICASSP 2024
This paper works on non-autoregressive automatic speech recognition. A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Compared to the regular CTC, the proposed method learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity. Experiments on three Mandarin datasets show that UMA demonstrates superior or comparable performance to other advanced non-autoregressive methods, such as self-conditioned CTC. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.
comment: Accepted by ICASSP 2024
♻ ☆ RoDia: A New Dataset for Romanian Dialect Identification from Speech NAACL 2024
We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
comment: Accepted at NAACL 2024
♻ ☆ Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment
Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.
comment: 12 pages, 8 figures. Iran J Comput Sci (2024)
Robotics
☆ Natural Language as Polices: Reasoning for Coordinate-Level Embodied Control with LLMs
We demonstrate experimental results with LLMs that address robotics action planning problems. Recently, LLMs have been applied in robotics action planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates action planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks.
comment: 8 pages, 2 figures
☆ A Convex Formulation of Frictional Contact for the Material Point Method and Rigid Bodies
In this paper, we introduce a novel convex formulation that seamlessly integrates the Material Point Method (MPM) with articulated rigid body dynamics in frictional contact scenarios. We extend the linear corotational hyperelastic model into the realm of elastoplasticity and include an efficient return mapping algorithm. This approach is particularly effective for MPM simulations involving significant deformation and topology changes, while preserving the convexity of the optimization problem. Our method ensures global convergence, enabling the use of large simulation time steps without compromising robustness. We have validated our approach through rigorous testing and performance evaluations, highlighting its superior capabilities in managing complex simulations relevant to robotics. Compared to previous MPM based robotic simulators, our method significantly improves the stability of contact resolution -- a critical factor in robot manipulation tasks. We make our method available in the open-source robotics toolkit, Drake.
☆ Certified Human Trajectory Prediction
Trajectory prediction plays an essential role in autonomous vehicles. While numerous strategies have been developed to enhance the robustness of trajectory prediction models, these methods are predominantly heuristic and do not offer guaranteed robustness against adversarial attacks and noisy observations. In this work, we propose a certification approach tailored for the task of trajectory prediction. To this end, we address the inherent challenges associated with trajectory prediction, including unbounded outputs, and mutli-modality, resulting in a model that provides guaranteed robustness. Furthermore, we integrate a denoiser into our method to further improve the performance. Through comprehensive evaluations, we demonstrate the effectiveness of the proposed technique across various baselines and using standard trajectory prediction datasets. The code will be made available online: https://s-attack.github.io/
☆ Embedding Pose Graph, Enabling 3D Foundation Model Capabilities with a Compact Representation
This paper presents the Embedding Pose Graph (EPG), an innovative method that combines the strengths of foundation models with a simple 3D representation suitable for robotics applications. Addressing the need for efficient spatial understanding in robotics, EPG provides a compact yet powerful approach by attaching foundation model features to the nodes of a pose graph. Unlike traditional methods that rely on bulky data formats like voxel grids or point clouds, EPG is lightweight and scalable. It facilitates a range of robotic tasks, including open-vocabulary querying, disambiguation, image-based querying, language-directed navigation, and re-localization in 3D environments. We showcase the effectiveness of EPG in handling these tasks, demonstrating its capacity to improve how robots interact with and navigate through complex spaces. Through both qualitative and quantitative assessments, we illustrate EPG's strong performance and its ability to outperform existing methods in re-localization. Our work introduces a crucial step forward in enabling robots to efficiently understand and operate within large-scale 3D spaces.
☆ Projection-free computation of robust controllable sets with constrained zonotopes
We study the problem of computing robust controllable sets for discrete-time linear systems with additive uncertainty. We propose a tractable and scalable approach to inner- and outer-approximate robust controllable sets using constrained zonotopes, when the additive uncertainty set is a symmetric, convex, and compact set. Our least-squares-based approach uses novel closed-form approximations of the Pontryagin difference between a constrained zonotopic minuend and a symmetric, convex, and compact subtrahend. Unlike existing approaches, our approach does not rely on convex optimization solvers, and is projection-free for ellipsoidal and zonotopic uncertainty sets. We also propose a least-squares-based approach to compute a convex, polyhedral outer-approximation to constrained zonotopes, and characterize sufficient conditions under which all these approximations are exact. We demonstrate the computational efficiency and scalability of our approach in several case studies, including the design of abort-safe rendezvous trajectories for a spacecraft in near-rectilinear halo orbit under uncertainty. Our approach can inner-approximate a 20-step robust controllable set for a 100-dimensional linear system in under 15 seconds on a standard computer.
comment: 22 pages, 6 figures
☆ Reinforcement Learning for Online Testing of Autonomous Driving Systems: a Replication and Extension Study
In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting or useless feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random testing. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing.
☆ DBA-Fusion: Tightly Integrating Deep Dense Visual Bundle Adjustment with Multiple Sensors for Large-Scale Localization and Mapping
Visual simultaneous localization and mapping (VSLAM) has broad applications, with state-of-the-art methods leveraging deep neural networks for better robustness and applicability. However, there is a lack of research in fusing these learning-based methods with multi-sensor information, which could be indispensable to push related applications to large-scale and complex scenarios. In this paper, we tightly integrate the trainable deep dense bundle adjustment (DBA) with multi-sensor information through a factor graph. In the framework, recurrent optical flow and DBA are performed among sequential images. The Hessian information derived from DBA is fed into a generic factor graph for multi-sensor fusion, which employs a sliding window and supports probabilistic marginalization. A pipeline for visual-inertial integration is firstly developed, which provides the minimum ability of metric-scale localization and mapping. Furthermore, other sensors (e.g., global navigation satellite system) are integrated for driftless and geo-referencing functionality. Extensive tests are conducted on both public datasets and self-collected datasets. The results validate the superior localization performance of our approach, which enables real-time dense mapping in large-scale environments. The code has been made open-source (https://github.com/GREAT-WHU/DBA-Fusion).
☆ What Matters for Active Texture Recognition With Vision-Based Tactile Sensors ICRA
This paper explores active sensing strategies that employ vision-based tactile sensors for robotic perception and classification of fabric textures. We formalize the active sampling problem in the context of tactile fabric recognition and provide an implementation of information-theoretic exploration strategies based on minimizing predictive entropy and variance of probabilistic models. Through ablation studies and human experiments, we investigate which components are crucial for quick and reliable texture recognition. Along with the active sampling strategies, we evaluate neural network architectures, representations of uncertainty, influence of data augmentation, and dataset variability. By evaluating our method on a previously published Active Clothing Perception Dataset and on a real robotic system, we establish that the choice of the active exploration strategy has only a minor influence on the recognition accuracy, whereas data augmentation and dropout rate play a significantly larger role. In a comparison study, while humans achieve 66.9% recognition accuracy, our best approach reaches 90.0% in under 5 touches, highlighting that vision-based tactile sensors are highly effective for fabric texture recognition.
comment: 7 pages, 9 figures, accepted at 2024 IEEE International Conference on Robotics and Automation (ICRA)
☆ Loss Regularizing Robotic Terrain Classification
Locomotion mechanics of legged robots are suitable when pacing through difficult terrains. Recognising terrains for such robots are important to fully yoke the versatility of their movements. Consequently, robotic terrain classification becomes significant to classify terrains in real time with high accuracy. The conventional classifiers suffer from overfitting problem, low accuracy problem, high variance problem, and not suitable for live dataset. On the other hand, classifying a growing dataset is difficult for convolution based terrain classification. Supervised recurrent models are also not practical for this classification. Further, the existing recurrent architectures are still evolving to improve accuracy of terrain classification based on live variable-length sensory data collected from legged robots. This paper proposes a new semi-supervised method for terrain classification of legged robots, avoiding preprocessing of long variable-length dataset. The proposed method has a stacked Long Short-Term Memory architecture, including a new loss regularization. The proposed method solves the existing problems and improves accuracy. Comparison with the existing architectures show the improvements.
comment: Preliminary draft of the work published in IEEE conference 2023
☆ DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses CVPR 2024
Determining the relative pose of an object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically approximate the continuous pose representation with a large number of discrete pose hypotheses, which incurs a computationally expensive process of scoring each hypothesis at test time. By contrast, we present a Deep Voxel Matching Network (DVMNet) that eliminates the need for pose hypotheses and computes the relative object pose in a single pass. To this end, we map the two input RGB images, reference and query, to their respective voxelized 3D representations. We then pass the resulting voxels through a pose estimation module, where the voxels are aligned and the pose is computed in an end-to-end fashion by solving a least-squares problem. To enhance robustness, we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D, LINEMOD, and Objaverse datasets, demonstrating that our method delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at: https://github.com/sailor-z/DVMNet/.
comment: Accepted by CVPR 2024
☆ Reward-Driven Automated Curriculum Learning for Interaction-Aware Self-Driving at Unsignalized Intersections
In this work, we present a reward-driven automated curriculum reinforcement learning approach for interaction-aware self-driving at unsignalized intersections, taking into account the uncertainties associated with surrounding vehicles (SVs). These uncertainties encompass the uncertainty of SVs' driving intention and also the quantity of SVs. To deal with this problem, the curriculum set is specifically designed to accommodate a progressively increasing number of SVs. By implementing an automated curriculum selection mechanism, the importance weights are rationally allocated across various curricula, thereby facilitating improved sample efficiency and training outcomes. Furthermore, the reward function is meticulously designed to guide the agent towards effective policy exploration. Thus the proposed framework could proactively address the above uncertainties at unsignalized intersections by employing the automated curriculum learning technique that progressively increases task difficulty, and this ensures safe self-driving through effective interaction with SVs. Comparative experiments are conducted in $Highway\_Env$, and the results indicate that our approach achieves the highest task success rate, attains strong robustness to initialization parameters of the curriculum selection module, and exhibits superior adaptability to diverse situational configurations at unsignalized intersections. Furthermore, the effectiveness of the proposed method is validated using the high-fidelity CARLA simulator.
comment: 8 pages, 6 figures
☆ LaCE-LHMP: Airflow Modelling-Inspired Long-Term Human Motion Prediction By Enhancing Laminar Characteristics in Human Flow ICRA
Long-term human motion prediction (LHMP) is essential for safely operating autonomous robots and vehicles in populated environments. It is fundamental for various applications, including motion planning, tracking, human-robot interaction and safety monitoring. However, accurate prediction of human trajectories is challenging due to complex factors, including, for example, social norms and environmental conditions. The influence of such factors can be captured through Maps of Dynamics (MoDs), which encode spatial motion patterns learned from (possibly scattered and partial) past observations of motion in the environment and which can be used for data-efficient, interpretable motion prediction (MoD-LHMP). To address the limitations of prior work, especially regarding accuracy and sensitivity to anomalies in long-term prediction, we propose the Laminar Component Enhanced LHMP approach (LaCE-LHMP). Our approach is inspired by data-driven airflow modelling, which estimates laminar and turbulent flow components and uses predominantly the laminar components to make flow predictions. Based on the hypothesis that human trajectory patterns also manifest laminar flow (that represents predictable motion) and turbulent flow components (that reflect more unpredictable and arbitrary motion), LaCE-LHMP extracts the laminar patterns in human dynamics and uses them for human motion prediction. We demonstrate the superior prediction performance of LaCE-LHMP through benchmark comparisons with state-of-the-art LHMP methods, offering an unconventional perspective and a more intuitive understanding of human movement patterns.
comment: Accepted to the 2024 IEEE International Conference on Robotics and Automation (ICRA)
☆ From One to Many: How Active Robot Swarm Sizes Influence Human Cognitive Processes
In robotics, understanding human interaction with autonomous systems is crucial for enhancing collaborative technologies. We focus on human-swarm interaction (HSI), exploring how differently sized groups of active robots affect operators' cognitive and perceptual reactions over different durations. We analyze the impact of different numbers of active robots within a 15-robot swarm on operators' time perception, emotional state, flow experience, and task difficulty perception. Our findings indicate that managing multiple active robots when compared to one active robot significantly alters time perception and flow experience, leading to a faster passage of time and increased flow. More active robots and extended durations cause increased emotional arousal and perceived task difficulty, highlighting the interaction between robot the number of active robots and human cognitive processes. These insights inform the creation of intuitive human-swarm interfaces and aid in developing swarm robotic systems aligned with human cognitive structures, enhancing human-robot collaboration.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Motion Generation from Fine-grained Textual Descriptions
The task of text2motion is to generate motion sequences from given textual descriptions, where a model should explore the interactions between natural language instructions and human body movements. While most existing works are confined to coarse-grained motion descriptions (e.g., "A man squats."), fine-grained ones specifying movements of relevant body parts are barely explored. Models trained with coarse texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure in generating motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset with fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with delicate prompts. Accordingly, we design a new text2motion model, FineMotionDiffuse, which makes full use of fine-grained textual information. Our experiments show that FineMotionDiffuse trained on FineHumanML3D acquires good results in quantitative evaluation. We also find this model can better generate spatially/chronologically composite motions by learning the implicit mappings from simple descriptions to the corresponding basic motions.
☆ Iterative Active-Inactive Obstacle Classification for Time-Optimal Collision Avoidance IROS24
Time-optimal obstacle avoidance is a prevalent problem encountered in various fields, including robotics and autonomous vehicles, where the task involves determining a path for a moving vehicle to reach its goal while navigating around obstacles within its environment. This problem becomes increasingly challenging as the number of obstacles in the environment rises. We propose an iterative active-inactive obstacle approach, which involves identifying a subset of the obstacles as "active", that considers solely the effect of the "active" obstacles on the path of the moving vehicle. The remaining obstacles are considered "inactive" and are not considered in the path planning process. The obstacles are classified as 'active' on the basis of previous findings derived from prior iterations. This approach allows for a more efficient calculation of the optimal path by reducing the number of obstacles that need to be considered. The effectiveness of the proposed method is demonstrated with two different dynamic models using the various number of obstacles. The results show that the proposed method is able to find the optimal path in a timely manner, while also being able to handle a large number of obstacles in the environment and the constraints on the motion of the object.
comment: This paper is under review in IROS24
☆ CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models
This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.
☆ FACT: Fast and Active Coordinate Initialization for Vision-based Drone Swarms
Swarm robots have sparked remarkable developments across a range of fields. While it is necessary for various applications in swarm robots, a fast and robust coordinate initialization in vision-based drone swarms remains elusive. To this end, our paper proposes a complete system to recover a swarm's initial relative pose on platforms with size, weight, and power (SWaP) constraints. To overcome limited coverage of field-of-view (FoV), the drones rotate in place to obtain observations. To tackle the anonymous measurements, we formulate a non-convex rotation estimation problem and transform it into a semi-definite programming (SDP) problem, which can steadily obtain global optimal values. Then we utilize the Hungarian algorithm to recover relative translation and correspondences between observations and drone identities. To safely acquire complete observations, we actively search for positions and generate feasible trajectories to avoid collisions. To validate the practicability of our system, we conduct experiments on a vision-based drone swarm with only stereo cameras and inertial measurement units (IMUs) as sensors. The results demonstrate that the system can robustly get accurate relative poses in real time with limited onboard computation resources. The source code is released.
☆ Mobile Robot Localization: a Modular, Odometry-Improving Approach
Despite the number of works published in recent years, vehicle localization remains an open, challenging problem. While map-based localization and SLAM algorithms are getting better and better, they remain a single point of failure in typical localization pipelines. This paper proposes a modular localization architecture that fuses sensor measurements with the outputs of off-the-shelf localization algorithms. The fusion filter estimates model uncertainties to improve odometry in case absolute pose measurements are lost entirely. The architecture is validated experimentally on a real robot navigating autonomously proving a reduction of the position error of more than 90% with respect to the odometrical estimate without uncertainty estimation in a two-minute navigation period without position measurements.
comment: Accepted at IEEE European Control Conference 2024
☆ Fast-Poly: A Fast Polyhedral Framework For 3D Multi-Object Tracking
3D Multi-Object Tracking (MOT) captures stable and comprehensive motion states of surrounding obstacles, essential for robotic perception. However, current 3D trackers face issues with accuracy and latency consistency. In this paper, we propose Fast-Poly, a fast and effective filter-based method for 3D MOT. Building upon our previous work Poly-MOT, Fast-Poly addresses object rotational anisotropy in 3D space, enhances local computation densification, and leverages parallelization technique, improving inference speed and precision. Fast-Poly is extensively tested on two large-scale tracking benchmarks with Python implementation. On the nuScenes dataset, Fast-Poly achieves new state-of-the-art performance with 75.8% AMOTA among all methods and can run at 34.2 FPS on a personal CPU. On the Waymo dataset, Fast-Poly exhibits competitive accuracy with 63.6% MOTA and impressive inference speed (35.5 FPS). The source code is publicly available at https://github.com/lixiaoyu2000/FastPoly.
comment: 1st on the NuScenes Tracking benchmark with 75.8 AMOTA and 34.2 FPS
☆ Automatic Navigation Map Generation for Mobile Robots in Urban Environments
A fundamental prerequisite for safe and efficient navigation of mobile robots is the availability of reliable navigation maps upon which trajectories can be planned. With the increasing industrial interest in mobile robotics, especially in urban environments, the process of generating navigation maps has become of particular interest, being a labor intensive step of the deployment process. Automating this step is challenging and becomes even more arduous when the perception capabilities are limited by cost considerations. This paper proposes an algorithm to automatically generate navigation maps using a typical navigation-oriented sensor setup: a single top-mounted 3D LiDAR sensor. The proposed method is designed and validated with the urban environment as the main use case: it is shown to be able to produce accurate maps featuring different terrain types, positive obstacles of different heights as well as negative obstacles. The algorithm is applied to data collected in a typical urban environment with a wheeled inverted pendulum robot, showing its robustness against localization, perception and dynamic uncertainties. The generated map is validated against a human-made map.
☆ Caching-Augmented Lifelong Multi-Agent Path Finding
Multi-Agent Path Finding (MAPF), which involves finding collision-free paths for multiple robots, is crucial in various applications. Lifelong MAPF, where targets are reassigned to agents as soon as they complete their initial objectives, offers a more accurate approximation of real-world warehouse planning. In this paper, we present a novel mechanism named Caching-Augmented Lifelong MAPF (CAL-MAPF), designed to improve the performance of Lifelong MAPF. We have developed a new map grid type called cache for temporary item storage and replacement and designed a lock mechanism for it to improve the stability of the planning solution. This cache mechanism was evaluated using various cache replacement policies and a spectrum of input task distributions. We identified three main factors significantly impacting CAL-MAPF performance through experimentation: suitable input task distribution, high cache hit rate, and smooth traffic. Overall, CAL-MAPF has demonstrated potential for performance improvements in certain task distributions, maps and agent configurations.
☆ Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments ICRA
Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at https://github.com/DLR-RM/UMF
comment: Accepted submission to International Conference on Robotics and Automation (ICRA), 2024
☆ Centroidal State Estimation based on the Koopman Embedding for Dynamic Legged Locomotion IROS 2024
In this paper, we introduce a novel approach to centroidal state estimation, which plays a crucial role in predictive model-based control strategies for dynamic legged locomotion. Our approach uses the Koopman operator theory to transform the robot's complex nonlinear dynamics into a linear system, by employing dynamic mode decomposition and deep learning for model construction. We evaluate both models on their linearization accuracy and capability to capture both fast and slow dynamic system responses. We then select the most suitable model for estimation purposes, and integrate it within a moving horizon estimator. This estimator is formulated as a convex quadratic program, to facilitate robust, real-time centroidal state estimation. Through extensive simulation experiments on a quadruped robot executing various dynamic gaits, our data-driven framework outperforms conventional filtering techniques based on nonlinear dynamics. Our estimator addresses challenges posed by force/torque measurement noise in highly dynamic motions and accurately recovers the centroidal states, demonstrating the adaptability and effectiveness of the Koopman-based linear representation for complex locomotive behaviors. Importantly, our model based on dynamic mode decomposition, trained with two locomotion patterns (trot and jump), successfully estimates the centroidal states for a different motion (bound) without retraining.
comment: Submitted to IROS 2024
☆ ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics IROS 2024
Robotic manipulation in everyday scenarios, especially in unstructured environments, requires skills in pose-aware object manipulation (POM), which adapts robots' grasping and handling according to an object's 6D pose. Recognizing an object's position and orientation is crucial for effective manipulation. For example, if a mug is lying on its side, it's more effective to grasp it by the rim rather than the handle. Despite its importance, research in POM skills remains limited, because learning manipulation skills requires pose-varying simulation environments and datasets. This paper introduces ManiPose, a pioneering benchmark designed to advance the study of pose-varying manipulation tasks. ManiPose encompasses: 1) Simulation environments for POM feature tasks ranging from 6D pose-specific pick-and-place of single objects to cluttered scenes, further including interactions with articulated objects. 2) A comprehensive dataset featuring geometrically consistent and manipulation-oriented 6D pose labels for 2936 real-world scanned rigid objects and 100 articulated objects across 59 categories. 3) A baseline for POM, leveraging the inferencing abilities of LLM (e.g., ChatGPT) to analyze the relationship between 6D pose and task-specific requirements, offers enhanced pose-aware grasp prediction and motion planning capabilities. Our benchmark demonstrates notable advancements in pose estimation, pose-aware manipulation, and real-robot skill transfer, setting new standards for POM research. We will open-source the ManiPose benchmark with the final version paper, inviting the community to engage with our resources, available at our website:https://sites.google.com/view/manipose.
comment: 8 pages, 7 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
☆ GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot
Multi-task robot learning holds significant importance in tackling diverse and complex scenarios. However, current approaches are hindered by performance issues and difficulties in collecting training datasets. In this paper, we propose GeRM (Generalist Robotic Model). We utilize offline reinforcement learning to optimize data utilization strategies to learn from both demonstrations and sub-optimal data, thus surpassing the limitations of human demonstrations. Thereafter, we employ a transformer-based VLA network to process multi-modal inputs and output actions. By introducing the Mixture-of-Experts structure, GeRM allows faster inference speed with higher whole model capacity, and thus resolves the issue of limited RL parameters, enhancing model performance in multi-task learning while controlling computational costs. Through a series of experiments, we demonstrate that GeRM outperforms other methods across all tasks, while also validating its efficiency in both training and inference processes. Additionally, we uncover its potential to acquire emergent skills. Additionally, we contribute the QUARD-Auto dataset, collected automatically to support our training approach and foster advancements in multi-task quadruped robot learning. This work presents a new paradigm for reducing the cost of collecting robot data and driving progress in the multi-task learning community.
☆ MULAN-WC: Multi-Robot Localization Uncertainty-aware Active NeRF with Wireless Coordination
This paper presents MULAN-WC, a novel multi-robot 3D reconstruction framework that leverages wireless signal-based coordination between robots and Neural Radiance Fields (NeRF). Our approach addresses key challenges in multi-robot 3D reconstruction, including inter-robot pose estimation, localization uncertainty quantification, and active best-next-view selection. We introduce a method for using wireless Angle-of-Arrival (AoA) and ranging measurements to estimate relative poses between robots, as well as quantifying and incorporating the uncertainty embedded in the wireless localization of these pose estimates into the NeRF training loss to mitigate the impact of inaccurate camera poses. Furthermore, we propose an active view selection approach that accounts for robot pose uncertainty when determining the next-best views to improve the 3D reconstruction, enabling faster convergence through intelligent view selection. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our framework in theory and in practice. Leveraging wireless coordination and localization uncertainty-aware training, MULAN-WC can achieve high-quality 3d reconstruction which is close to applying the ground truth camera poses. Furthermore, the quantification of the information gain from a novel view enables consistent rendering quality improvement with incrementally captured images by commending the robot the novel view position. Our hardware experiments showcase the practicality of deploying MULAN-WC to real robotic systems.
☆ Discretizing SO(2)-Equivariant Features for Robotic Kitting
Robotic kitting has attracted considerable attention in logistics and industrial settings. However, existing kitting methods encounter challenges such as low precision and poor efficiency, limiting their widespread applications. To address these issues, we present a novel kitting framework that improves both the precision and computational efficiency of complex kitting tasks. Firstly, our approach introduces a fine-grained orientation estimation technique in the picking module, significantly enhancing orientation precision while effectively decoupling computational load from orientation granularity. This approach combines an SO(2)-equivariant network with a group discretization operation to preciously predict discrete orientation distributions. Secondly, we develop the Hand-tool Kitting Dataset (HKD) to evaluate the performance of different solutions in handling orientation-sensitive kitting tasks. This dataset comprises a diverse collection of hand tools and synthetically created kits, which reflects the complexities encountered in real-world kitting scenarios. Finally, a series of experiments are conducted to evaluate the performance of the proposed method. The results demonstrate that our approach offers remarkable precision and enhanced computational efficiency in robotic kitting tasks.
comment: 8 pages, 6 figures
☆ AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving
As an essential task in autonomous driving (AD), motion prediction aims to predict the future states of surround objects for navigation. One natural solution is to estimate the position of other agents in a step-by-step manner where each predicted time-step is conditioned on both observed time-steps and previously predicted time-steps, i.e., autoregressive prediction. Pioneering works like SocialLSTM and MFP design their decoders based on this intuition. However, almost all state-of-the-art works assume that all predicted time-steps are independent conditioned on observed time-steps, where they use a single linear layer to generate positions of all time-steps simultaneously. They dominate most motion prediction leaderboards due to the simplicity of training MLPs compared to autoregressive networks. In this paper, we introduce the GPT style next token prediction into motion forecasting. In this way, the input and output could be represented in a unified space and thus the autoregressive prediction becomes more feasible. However, different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. To this end, we propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations, e.g., encoding the transformation between coordinate systems for spatial relativity while adopting RoPE for temporal relativity. Empirically, by equipping with the aforementioned tailored designs, the proposed method achieves state-of-the-art performance in the Waymo Open Motion and Waymo Interaction datasets. Notably, AMP outperforms other recent autoregressive motion prediction methods: MotionLM and StateTransformer, which demonstrates the effectiveness of the proposed designs.
☆ Robotics meets Fluid Dynamics: A Characterization of the Induced Airflow around a Quadrotor
The widespread adoption of quadrotors for diverse applications, from agriculture to public safety, necessitates an understanding of the aerodynamic disturbances they create. This paper introduces a computationally lightweight model for estimating the time-averaged magnitude of the induced flow below quadrotors in hover. Unlike related approaches that rely on expensive computational fluid dynamics (CFD) simulations or time-consuming empirical measurements, our method leverages classical theory from turbulent flows. By analyzing over 9 hours of flight data from drones of varying sizes within a large motion capture system, we show that the combined flow from all propellers of the drone is well-approximated by a turbulent jet. Through the use of a novel normalization and scaling, we have developed and experimentally validated a unified model that describes the mean velocity field of the induced flow for different drone sizes. The model accurately describes the far-field airflow in a very large volume below the drone which is difficult to simulate in CFD. Our model, which requires only the drone's mass, propeller size, and drone size for calculations, offers a practical tool for dynamic planning in multi-agent scenarios, ensuring safer operations near humans and optimizing sensor placements.
comment: 7+1 pages
☆ Workload Estimation for Unknown Tasks: A Survey of Machine Learning Under Distribution Shift
Human-robot teams involve humans and robots collaborating to achieve tasks under various environmental conditions. Successful teaming will require robots to adapt autonomously to a human teammate's internal state. An important element of such adaptation is the ability to estimate the human teammates' workload in unknown situations. Existing workload models use machine learning to model the relationships between physiological metrics and workload; however, these methods are susceptible to individual differences and are heavily influenced by other factors. These methods cannot generalize to unknown tasks, as they rely on standard machine learning approaches that assume data consists of independent and identically distributed (IID) samples. This assumption does not necessarily hold for estimating workload for new tasks. A survey of non-IID machine learning techniques is presented, where commonly used techniques are evaluated using three criteria: portability, model complexity, and adaptability. These criteria are used to argue which techniques are most applicable for estimating workload for unknown tasks in dynamic, real-time environments.
☆ Multi-Robot Connected Fermat Spiral Coverage ICAPS24
We introduce the Multi-Robot Connected Fermat Spiral (MCFS), a novel algorithmic framework for Multi-Robot Coverage Path Planning (MCPP) that adapts Connected Fermat Spiral (CFS) from the computer graphics community to multi-robot coordination for the first time. MCFS uniquely enables the orchestration of multiple robots to generate coverage paths that contour around arbitrarily shaped obstacles, a feature that is notably lacking in traditional methods. Our framework not only enhances area coverage and optimizes task performance, particularly in terms of makespan, for workspaces rich in irregular obstacles but also addresses the challenges of path continuity and curvature critical for non-holonomic robots by generating smooth paths without decomposing the workspace. MCFS solves MCPP by constructing a graph of isolines and transforming MCPP into a combinatorial optimization problem, aiming to minimize the makespan while covering all vertices. Our contributions include developing a unified CFS version for scalable and adaptable MCPP, extending it to MCPP with novel optimization techniques for cost reduction and path continuity and smoothness, and demonstrating through extensive experiments that MCFS outperforms existing MCPP methods in makespan, path curvature, coverage ratio, and overlapping ratio. Our research marks a significant step in MCPP, showcasing the fusion of computer graphics and automated planning principles to advance the capabilities of multi-robot systems in complex environments. Our code is available at https://github.com/reso1/MCFS.
comment: accepted to ICAPS24
☆ POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints
In this paper, we seek to learn a robot policy guaranteed to satisfy state constraints. To encourage constraint satisfaction, existing RL algorithms typically rely on Constrained Markov Decision Processes and discourage constraint violations through reward shaping. However, such soft constraints cannot offer verifiable safety guarantees. To address this gap, we propose POLICEd RL, a novel RL algorithm explicitly designed to enforce affine hard constraints in closed-loop with a black-box environment. Our key insight is to force the learned policy to be affine around the unsafe set and use this affine region as a repulsive buffer to prevent trajectories from violating the constraint. We prove that such policies exist and guarantee constraint satisfaction. Our proposed framework is applicable to both systems with continuous and discrete state and action spaces and is agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of POLICEd RL to enforce hard constraints in robotic tasks while significantly outperforming existing methods.
comment: 26 pages, 11 figures
☆ Map-Aware Human Pose Prediction for Robot Follow-Ahead
In the robot follow-ahead task, a mobile robot is tasked to maintain its relative position in front of a moving human actor while keeping the actor in sight. To accomplish this task, it is important that the robot understand the full 3D pose of the human (since the head orientation can be different than the torso) and predict future human poses so as to plan accordingly. This prediction task is especially tricky in a complex environment with junctions and multiple corridors. In this work, we address the problem of forecasting the full 3D trajectory of a human in such environments. Our main insight is to show that one can first predict the 2D trajectory and then estimate the full 3D trajectory by conditioning the estimator on the predicted 2D trajectory. With this approach, we achieve results comparable or better than the state-of-the-art methods three times faster. As part of our contribution, we present a new dataset where, in contrast to existing datasets, the human motion is in a much larger area than a single room. We also present a complete robot system that integrates our human pose forecasting network on the mobile robot to enable real-time robot follow-ahead and present results from real-world experiments in multiple buildings on campus. Our project page, including supplementary material and videos, can be found at: https://qingyuan-jiang.github.io/iros2024_poseForecasting/
☆ Look Before You Leap: Socially Acceptable High-Speed Ground Robot Navigation in Crowded Hallways IROS 2024
To operate safely and efficiently, autonomous warehouse/delivery robots must be able to accomplish tasks while navigating in dynamic environments and handling the large uncertainties associated with the motions/behaviors of other robots and/or humans. A key scenario in such environments is the hallway problem, where robots must operate in the same narrow corridor as human traffic going in one or both directions. Traditionally, robot planners have tended to focus on socially acceptable behavior in the hallway scenario at the expense of performance. This paper proposes a planner that aims to address the consequent "robot freezing problem" in hallways by allowing for "peek-and-pass" maneuvers. We then go on to demonstrate in simulation how this planner improves robot time to goal without violating social norms. Finally, we show initial hardware demonstrations of this planner in the real world.
comment: Submitted to IROS 2024
☆ Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks
Robot arms should be able to learn new tasks. One framework here is reinforcement learning, where the robot is given a reward function that encodes the task, and the robot autonomously learns actions to maximize its reward. Existing approaches to reinforcement learning often frame this problem as a Markov decision process, and learn a policy (or a hierarchy of policies) to complete the task. These policies reason over hundreds of fine-grained actions that the robot arm needs to take: e.g., moving slightly to the right or rotating the end-effector a few degrees. But the manipulation tasks that we want robots to perform can often be broken down into a small number of high-level motions: e.g., reaching an object or turning a handle. In this paper we therefore propose a waypoint-based approach for model-free reinforcement learning. Instead of learning a low-level policy, the robot now learns a trajectory of waypoints, and then interpolates between those waypoints using existing controllers. Our key novelty is framing this waypoint-based setting as a sequence of multi-armed bandits: each bandit problem corresponds to one waypoint along the robot's motion. We theoretically show that an ideal solution to this reformulation has lower regret bounds than standard frameworks. We also introduce an approximate posterior sampling solution that builds the robot's motion one waypoint at a time. Results across benchmark simulations and two real-world experiments suggest that this proposed approach learns new tasks more quickly than state-of-the-art baselines. See videos here: https://youtu.be/MMEd-lYfq4Y
☆ UNO Push: Unified Nonprehensile Object Pushing via Non-Parametric Estimation and Model Predictive Control
Nonprehensile manipulation through precise pushing is an essential skill that has been commonly challenged by perception and physical uncertainties, such as those associated with contacts, object geometries, and physical properties. For this, we propose a unified framework that jointly addresses system modeling, action generation, and control. While most existing approaches either heavily rely on a priori system information for analytic modeling, or leverage a large dataset to learn dynamic models, our framework approximates a system transition function via non-parametric learning only using a small number of exploratory actions (ca. 10). The approximated function is then integrated with model predictive control to provide precise pushing manipulation. Furthermore, we show that the approximated system transition functions can be robustly transferred across novel objects while being online updated to continuously improve the manipulation accuracy. Through extensive experiments on a real robot platform with a set of novel objects and comparing against a state-of-the-art baseline, we show that the proposed unified framework is a light-weight and highly effective approach to enable precise pushing manipulation all by itself. Our evaluation results illustrate that the system can robustly ensure millimeter-level precision and can straightforwardly work on any novel object.
☆ Enhancing Security in Multi-Robot Systems through Co-Observation Planning, Reachability Analysis, and Network Flow
This paper addresses security challenges in multi-robot systems (MRS) where adversaries may compromise robot control, risking unauthorized access to forbidden areas. We propose a novel multi-robot optimal planning algorithm that integrates mutual observations and introduces reachability constraints for enhanced security. This ensures that, even with adversarial movements, compromised robots cannot breach forbidden regions without missing scheduled co-observations. The reachability constraint uses ellipsoidal over-approximation for efficient intersection checking and gradient computation. To enhance system resilience and tackle feasibility challenges, we also introduce sub-teams. These cohesive units replace individual robot assignments along each route, enabling redundant robots to deviate for co-observations across different trajectories, securing multiple sub-teams without requiring modifications. We formulate the cross-trajectory co-observation plan by solving a network flow coverage problem on the checkpoint graph generated from the original unsecured MRS trajectories, providing the same security guarantees against plan-deviation attacks. We demonstrate the effectiveness and robustness of our proposed algorithm, which significantly strengthens the security of multi-robot systems in the face of adversarial threats.
comment: 12 pages, 6 figures, submitted to IEEE Transactions on Control of Network Systems
☆ A Rule-Compliance Path Planner for Lane-Merge Scenarios Based on Responsibility-Sensitive Safety IROS 2024
Lane merging is one of the critical tasks for self-driving cars, and how to perform lane-merge maneuvers effectively and safely has become one of the important standards in measuring the capability of autonomous driving systems. However, due to the ambiguity in driving intentions and right-of-way issues, the lane merging process in autonomous driving remains deficient in terms of maintaining or ceding the right-of-way and attributing liability, which could result in protracted durations for merging and problems such as trajectory oscillation. Hence, we present a rule-compliance path planner (RCPP) for lane-merge scenarios, which initially employs the extended responsibility-sensitive safety (RSS) to elucidate the right-of-way, followed by the potential field-based sigmoid planner for path generation. In the simulation, we have validated the efficacy of the proposed algorithm. The algorithm demonstrated superior performance over previous approaches in aspects such as merging time (Saved 72.3%), path length (reduced 53.4%), and eliminating the trajectory oscillation.
comment: Submitted to IEEE IROS 2024
☆ Federated reinforcement learning for robot motion planning with zero-shot generalization
This paper considers the problem of learning a control policy for robot motion planning with zero-shot generalization, i.e., no data collection and policy adaptation is needed when the learned policy is deployed in new environments. We develop a federated reinforcement learning framework that enables collaborative learning of multiple learners and a central server, i.e., the Cloud, without sharing their raw data. In each iteration, each learner uploads its local control policy and the corresponding estimated normalized arrival time to the Cloud, which then computes the global optimum among the learners and broadcasts the optimal policy to the learners. Each learner then selects between its local control policy and that from the Cloud for next iteration. The proposed framework leverages on the derived zero-shot generalization guarantees on arrival time and safety. Theoretical guarantees on almost-sure convergence, almost consensus, Pareto improvement and optimality gap are also provided. Monte Carlo simulation is conducted to evaluate the proposed framework.
☆ AMCO: Adaptive Multimodal Coupling of Vision and Proprioception for Quadruped Robot Navigation in Outdoor Environments
We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot's vision and proprioception data, and couples them to obtain a coupled traversability cost map for navigation. The general knowledge map encodes terrains semantically segmented from visual sensing, and represents a terrain's typically expected traversability. The traversability history map encodes the robot's recent proprioceptive measurements on a terrain and its semantic segmentation as a cost map. Further, the robot's present proprioceptive measurement is encoded as a cost map in the current proprioception map. As the general knowledge map and traversability history map rely on semantic segmentation, we evaluate the reliability of the visual sensory data by estimating the brightness and motion blur of input RGB images and accordingly combine the three cost maps to obtain the coupled traversability cost map used for navigation. Leveraging this adaptive coupling, the robot can depend on the most reliable input modality available. Finally, we present a novel planner that selects appropriate gaits and velocities for traversing challenging outdoor environments using the coupled traversability cost map. We demonstrate AMCO's navigation performance in different real-world outdoor environments and observe 10.8%-34.9% reduction w.r.t. two stability metrics, and up to 50% improvement in terms of success rate compared to current navigation methods.
comment: 8 pages
☆ A Contact Model based on Denoising Diffusion to Learn Variable Impedance Control for Contact-rich Manipulation
In this paper, a novel approach is proposed for learning robot control in contact-rich tasks such as wiping, by developing Diffusion Contact Model (DCM). Previous methods of learning such tasks relied on impedance control with time-varying stiffness tuning by performing Bayesian optimization by trial-and-error with robots. The proposed approach aims to reduce the cost of robot operation by predicting the robot contact trajectories from the variable stiffness inputs and using neural models. However, contact dynamics are inherently highly nonlinear, and their simulation requires iterative computations such as convex optimization. Moreover, approximating such computations by using finite-layer neural models is difficult. To overcome these limitations, the proposed DCM used the denoising diffusion models that could simulate the complex dynamics via iterative computations of multi-step denoising, thus improving the prediction accuracy. Stiffness tuning experiments conducted in simulated and real environments showed that the DCM achieved comparable performance to a conventional robot-based optimization method while reducing the number of robot trials.
☆ "It's Not a Replacement:" Enabling Parent-Robot Collaboration to Support In-Home Learning Experiences of Young Children
Learning companion robots for young children are increasingly adopted in informal learning environments. Although parents play a pivotal role in their children's learning, very little is known about how parents prefer to incorporate robots into their children's learning activities. We developed prototype capabilities for a learning companion robot to deliver educational prompts and responses to parent-child pairs during reading sessions and conducted in-home user studies involving 10 families with children aged 3-5. Our data indicates that parents want to work with robots as collaborators to augment parental activities to foster children's learning, introducing the notion of parent-robot collaboration. Our findings offer an empirical understanding of the needs and challenges of parent-child interaction in informal learning scenarios and design opportunities for integrating a companion robot into these interactions. We offer insights into how robots might be designed to facilitate parent-robot collaboration, including parenting policies, collaboration patterns, and interaction paradigms.
☆ Quadcopter Team Configurable Motion Guided by a Quadruped
The paper focuses on modeling and experimental evaluation of a quadcopter team configurable coordination guided by a single quadruped robot. We consider the quadcopter team as particles of a two-dimensional deformable body and propose a two-dimensional affine transformation model for safe and collision-free configurable coordination of this heterogeneous robotic system. The proposed affine transformation is decomposed into translation, that is specified by the quadruped global position, and configurable motion of the quadcopters, which is determined by a nonsingular Jacobian matrix so that the quadcopter team can safely navigate a constrained environment while avoiding collision. We propose two methods to experimentally evaluate the proposed heterogeneous robot coordination model. The first method measures real positions of quadcopters, quadruped, and environmental objects all with respect to the global coordinate system. On the other hand, the second method measures position with respect to the local coordinate system fixed on the dog robot which in turn enables safe planning the Jacobian matrix of the quadcopter team while the world is virtually approached the robotic system.
☆ HRI Curriculum for a Liberal Arts Education
In this paper, we discuss the opportunities and challenges of teaching a human-robot interaction course at an undergraduate liberal arts college. We provide a sample syllabus adapted from a previous version of a course.
comment: Presented at the Designing an Intro to HRI Course Workshop at HRI 2024 (arXiv:2403.05588)
☆ Crowdsourcing Task Traces for Service Robotics
Demonstration is an effective end-user development paradigm for teaching robots how to perform new tasks. In this paper, we posit that demonstration is useful not only as a teaching tool, but also as a way to understand and assist end-user developers in thinking about a task at hand. As a first step toward gaining this understanding, we constructed a lightweight web interface to crowdsource step-by-step instructions of common household tasks, leveraging the imaginations and past experiences of potential end-user developers. As evidence of the utility of our interface, we deployed the interface on Amazon Mechanical Turk and collected 207 task traces that span 18 different task categories. We describe our vision for how these task traces can be operationalized as task models within end-user development tools and provide a roadmap for future work.
comment: Published in the companion proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction
☆ Visual Imitation Learning of Task-Oriented Object Grasping and Rearrangement
Task-oriented object grasping and rearrangement are critical skills for robots to accomplish different real-world manipulation tasks. However, they remain challenging due to partial observations of the objects and shape variations in categorical objects. In this paper, we propose the Multi-feature Implicit Model (MIMO), a novel object representation that encodes multiple spatial features between a point and an object in an implicit neural field. Training such a model on multiple features ensures that it embeds the object shapes consistently in different aspects, thus improving its performance in object shape reconstruction from partial observation, shape similarity measure, and modeling spatial relations between objects. Based on MIMO, we propose a framework to learn task-oriented object grasping and rearrangement from single or multiple human demonstration videos. The evaluations in simulation show that our approach outperforms the state-of-the-art methods for multi- and single-view observations. Real-world experiments demonstrate the efficacy of our approach in one- and few-shot imitation learning of manipulation tasks.
☆ Goal-Oriented End-User Programming of Robots
End-user programming (EUP) tools must balance user control with the robot's ability to plan and act autonomously. Many existing task-oriented EUP tools enforce a specific level of control, e.g., by requiring that users hand-craft detailed sequences of actions, rather than offering users the flexibility to choose the level of task detail they wish to express. We thereby created a novel EUP system, Polaris, that in contrast to most existing EUP tools, uses goal predicates as the fundamental building block of programs. Users can thereby express high-level robot objectives or lower-level checkpoints at their choosing, while an off-the-shelf task planner fills in any remaining program detail. To ensure that goal-specified programs adhere to user expectations of robot behavior, Polaris is equipped with a Plan Visualizer that exposes the planner's output to the user before runtime. In what follows, we describe our design of Polaris and its evaluation with 32 human participants. Our results support the Plan Visualizer's ability to help users craft higher-quality programs. Furthermore, there are strong associations between user perception of the robot and Plan Visualizer usage, and evidence that robot familiarity has a key role in shaping user experience.
comment: Published in the proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction
☆ Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robot
This paper presents a new software framework for HRI experimentation with the sixth version of the common NAO robot produced by the United Robotics Group. Embracing the common demand of researchers for better performance and new features for NAO, the authors took advantage of the ability to run ROS2 onboard on the NAO to develop a framework independent of the APIs provided by the manufacturer. Such a system provides NAO with not only the basic skills of a humanoid robot such as walking and reproducing movements of interest but also features often used in HRI such as: speech recognition/synthesis, face and object detention, and the use of Generative Pre-trained Transformer (GPT) models for conversation. The developed code is therefore configured as a ready-to-use but also highly expandable and improvable tool thanks to the possibilities provided by the ROS community.
comment: 7 pages, 3 figures
☆ Sensory Glove-Based Surgical Robot User Interface IROS
Robotic surgery has reached a high level of maturity and has become an integral part of standard surgical care. However, existing surgeon consoles are bulky and take up valuable space in the operating room, present challenges for surgical team coordination, and their proprietary nature makes it difficult to take advantage of recent technological advances, especially in virtual and augmented reality. One potential area for further improvement is the integration of modern sensory gloves into robotic platforms, allowing surgeons to control robotic arms directly with their hand movements intuitively. We propose one such system that combines an HTC Vive tracker, a Manus Meta Prime 3 XR sensory glove, and God Vision wireless smart glasses. The system controls one arm of a da Vinci surgical robot. In addition to moving the arm, the surgeon can use fingers to control the end-effector of the surgical instrument. Hand gestures are used to implement clutching and similar functions. In particular, we introduce clutching of the instrument orientation, a functionality not available in the da Vinci system. The vibrotactile elements of the glove are used to provide feedback to the user when gesture commands are invoked. A preliminary evaluation of the system shows that it has excellent tracking accuracy and allows surgeons to efficiently perform common surgical training tasks with minimal practice with the new interface; this suggests that the interface is highly intuitive. The proposed system is inexpensive, allows rapid prototyping, and opens opportunities for further innovations in the design of surgical robot interfaces.
comment: 6 pages, 5 figures, 7 tables, submitted to International Conference on Intelligent Robots and Systems (IROS)2024
☆ Safety-Aware Perception for Autonomous Collision Avoidance in Dynamic Environments
Autonomous collision avoidance requires accurate environmental perception; however, flight systems often possess limited sensing capabilities with field-of-view (FOV) restrictions. To navigate this challenge, we present a safety-aware approach for online determination of the optimal sensor-pointing direction $\psi_\text{d}$ which utilizes control barrier functions (CBFs). First, we generate a spatial density function $\Phi$ which leverages CBF constraints to map the collision risk of all local coordinates. Then, we convolve $\Phi$ with an attitude-dependent sensor FOV quality function to produce the objective function $\Gamma$ which quantifies the total observed risk for a given pointing direction. Finally, by finding the global optimizer for $\Gamma$, we identify the value of $\psi_\text{d}$ which maximizes the perception of risk within the FOV. We incorporate $\psi_\text{d}$ into a safety-critical flight architecture and conduct a numerical analysis using multiple simulated mission profiles. Our algorithm achieves a success rate of $88-96\%$, constituting a $16-29\%$ improvement compared to the best heuristic methods. We demonstrate the functionality of our approach via a flight demonstration using the Crazyflie 2.1 micro-quadrotor. Without a priori obstacle knowledge, the quadrotor follows a dynamic flight path while simultaneously calculating and tracking $\psi_\text{d}$ to perceive and avoid two static obstacles with an average computation time of 371 $\mu$s.
☆ Augmented Reality Demonstrations for Scalable Robot Imitation Learning
Robot Imitation Learning (IL) is a widely used method for training robots to perform manipulation tasks that involve mimicking human demonstrations to acquire skills. However, its practicality has been limited due to its requirement that users be trained in operating real robot arms to provide demonstrations. This paper presents an innovative solution: an Augmented Reality (AR)-assisted framework for demonstration collection, empowering non-roboticist users to produce demonstrations for robot IL using devices like the HoloLens 2. Our framework facilitates scalable and diverse demonstration collection for real-world tasks. We validate our approach with experiments on three classical robotics tasks: reach, push, and pick-and-place. The real robot performs each task successfully while replaying demonstrations collected via AR.
♻ ☆ Ada-NAV: Adaptive Trajectory Length-Based Sample Efficient Policy Learning for Robotic Navigation
Trajectory length stands as a crucial hyperparameter within reinforcement learning (RL) algorithms, significantly contributing to the sample inefficiency in robotics applications. Motivated by the pivotal role trajectory length plays in the training process, we introduce Ada-NAV, a novel adaptive trajectory length scheme designed to enhance the training sample efficiency of RL algorithms in robotic navigation tasks. Unlike traditional approaches that treat trajectory length as a fixed hyperparameter, we propose to dynamically adjust it based on the entropy of the underlying navigation policy. Interestingly, Ada-NAV can be applied to both existing on-policy and off-policy RL methods, which we demonstrate by empirically validating its efficacy on three popular RL methods: REINFORCE, Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). We demonstrate through simulated and real-world robotic experiments that Ada-NAV outperforms conventional methods that employ constant or randomly sampled trajectory lengths. Specifically, for a fixed sample budget, Ada-NAV achieves an 18\% increase in navigation success rate, a 20-38\% reduction in navigation path length, and a 9.32\% decrease in elevation costs. Furthermore, we showcase the versatility of Ada-NAV by integrating it with the Clearpath Husky robot, illustrating its applicability in complex outdoor environments.
comment: 11 pages, 9 figures, 2 tables
♻ ☆ uPLAM: Robust Panoptic Localization and Mapping Leveraging Perception Uncertainties
The availability of a robust map-based localization system is essential for the operation of many autonomously navigating vehicles. Since uncertainty is an inevitable part of perception, it is beneficial for the robustness of the robot to consider it in typical downstream tasks of navigation stacks. In particular localization and mapping methods, which in modern systems often employ convolutional neural networks (CNNs) for perception tasks, require proper uncertainty estimates. In this work, we present uncertainty-aware Panoptic Localization and Mapping (uPLAM), which employs pixel-wise uncertainty estimates for panoptic CNNs as a bridge to fuse modern perception with classical probabilistic localization and mapping approaches. Beyond the perception, we introduce an uncertainty-based map aggregation technique to create accurate panoptic maps, containing surface semantics and landmark instances. Moreover, we provide cell-wise map uncertainties, and present a particle filter-based localization method that employs perception uncertainties. Extensive evaluations show that our proposed incorporation of uncertainties leads to more accurate maps with reliable uncertainty estimates and improved localization accuracy. Additionally, we present the Freiburg Panoptic Driving dataset for evaluating panoptic mapping and localization methods. We make our code and dataset available at: \url{http://uplam.cs.uni-freiburg.de}
♻ ☆ Exposing the Unseen: Exposure Time Emulation for Offline Benchmarking of Vision Algorithms IROS 2024
Visual Odometry (VO) is one of the fundamental tasks in computer vision for robotics. However, its performance is deeply affected by High Dynamic Range (HDR) scenes, omnipresent outdoor. While new Automatic-Exposure (AE) approaches to mitigate this have appeared, their comparison in a reproducible manner is problematic. This stems from the fact that the behavior of AE depends on the environment, and it affects the image acquisition process. Consequently, AE has traditionally only been benchmarked in an online manner, making the experiments non-reproducible. To solve this, we propose a new methodology based on an emulator that can generate images at any exposure time. It leverages BorealHDR, a unique multi-exposure stereo dataset collected over 10 km, on 55 trajectories with challenging illumination conditions. Moreover, it includes lidar-inertial-based global maps with pose estimation for each image frame as well as Global Navigation Satellite System (GNSS) data, for comparison. We show that using these images acquired at different exposure times, we can emulate realistic images, keeping a Root-Mean-Square Error (RMSE) below 1.78 % compared to ground truth images. To demonstrate the practicality of our approach for offline benchmarking, we compared three state-of-the-art AE algorithms on key elements of Visual Simultaneous Localization And Mapping (VSLAM) pipeline, against four baselines. Consequently, reproducible evaluation of AE is now possible, speeding up the development of future approaches. Our code and dataset are available online at this link: https://github.com/norlab-ulaval/BorealHDR
comment: 8 pages, 6 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
♻ ☆ Investigation of Enhanced Inertial Navigation Algorithms by Functional Iteration
The defects of the traditional strapdown inertial navigation algorithms become well acknowledged and the corresponding enhanced algorithms have been quite recently proposed trying to mitigate both theoretical and algorithmic defects. In this paper, the analytical accuracy evaluation of both the traditional algorithms and the enhanced algorithms is investigated, against the true reference for the first time enabled by the functional iteration approach having provable convergence. The analyses by the help of MATLAB Symbolic Toolbox show that the resultant error orders of all algorithms under investigation are consistent with those in the existing literatures, and the enhanced attitude algorithm notably reduces error orders of the traditional counterpart, while the impact of the enhanced velocity algorithm on error order reduction is insignificant. Simulation results agree with analyses that the superiority of the enhanced algorithm over the traditional one in the body-frame attitude computation scenario diminishes significantly in the entire inertial navigation computation scenario, while the functional iteration approach possesses significant accuracy superiority even under sustained lowly dynamic conditions.
comment: 12 pages, 3 figs
♻ ☆ Intention-Aware Planner for Robust and Safe Aerial Tracking IROS
Autonomous target tracking with quadrotors has wide applications in many scenarios, such as cinematographic follow-up shooting or suspect chasing. Target motion prediction is necessary when designing the tracking planner. However, the widely used constant velocity or constant rotation assumption can not fully capture the dynamics of the target. The tracker may fail when the target happens to move aggressively, such as sudden turn or deceleration. In this paper, we propose an intention-aware planner by additionally considering the intention of the target to enhance safety and robustness in aerial tracking applications. Firstly, a designated intention prediction method is proposed, which combines a user-defined potential assessment function and a state observation function. A reachable region is generated to specifically evaluate the turning intentions. Then we design an intention-driven hybrid A* method to predict the future possible positions for the target. Finally, an intention-aware optimization approach is designed to generate a spatial-temporal optimal trajectory, allowing the tracker to perceive unexpected situations from the target. Benchmark comparisons and real-world experiments are conducted to validate the performance of our method.
comment: 8 pages, 10 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ Surfer: Progressive Reasoning with World Models for Robotic Manipulation
Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data. However, most existing vision and language robot manipulation methods mainly operate in less realistic simulator and language settings and lack explicit modeling of world knowledge. To bridge this gap, we introduce a novel and simple robot manipulation framework, called Surfer. It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene. Then, the generalization ability of the model on new instructions and new scenes is enhanced by explicit modeling of the action and scene prediction in multi-modal information. In addition to the framework, we also built a robot manipulation simulator that supports full physics execution based on the MuJoCo physics engine. It can automatically generate demonstration training data and test data, effectively reducing labor costs. To conduct a comprehensive and systematic evaluation of the robot manipulation model in terms of language understanding and physical execution, we also created a robotic manipulation benchmark with progressive reasoning tasks, called SeaWave. It contains 4 levels of progressive reasoning tasks and can provide a standardized testing platform for embedded AI agents in multi-modal environments. On average, Surfer achieved a success rate of 54.74% on the defined four levels of manipulation tasks, exceeding the best baseline performance of 47.64%.
♻ ☆ LLM3:Large Language Model-based Task and Motion Planning with Motion Failure Reasoning IROS 2024
Conventional Task and Motion Planning (TAMP) approaches rely on manually crafted interfaces connecting symbolic task planning with continuous motion generation. These domain-specific and labor-intensive modules are limited in addressing emerging tasks in real-world settings. Here, we present LLM^3, a novel Large Language Model (LLM)-based TAMP framework featuring a domain-independent interface. Specifically, we leverage the powerful reasoning and planning capabilities of pre-trained LLMs to propose symbolic action sequences and select continuous action parameters for motion planning. Crucially, LLM^3 incorporates motion planning feedback through prompting, allowing the LLM to iteratively refine its proposals by reasoning about motion failure. Consequently, LLM^3 interfaces between task planning and motion planning, alleviating the intricate design process of handling domain-specific messages between them. Through a series of simulations in a box-packing domain, we quantitatively demonstrate the effectiveness of LLM^3 in solving TAMP problems and the efficiency in selecting action parameters. Ablation studies underscore the significant contribution of motion failure reasoning to the success of LLM^3. Furthermore, we conduct qualitative experiments on a physical manipulator, demonstrating the practical applicability of our approach in real-world settings.
comment: Submitted to IROS 2024. Codes available: https://github.com/AssassinWS/LLM-TAMP
♻ ☆ Pseudo-rigid body networks: learning interpretable deformable object dynamics from partial observations
Accurate prediction of deformable linear object (DLO) dynamics is challenging if the task at hand requires a human-interpretable yet computationally fast model. In this work, we draw inspiration from the pseudo-rigid body method (PRB) and model a DLO as a serial chain of rigid bodies whose internal state is unrolled through time by a dynamics network. This dynamics network is trained jointly with a physics-informed encoder which maps observed motion variables to the DLO's hidden state. To encourage that the state acquires a physically meaningful representation, we leverage the forward kinematics of the PRB model as decoder. We demonstrate in robot experiments that the proposed DLO dynamics model provides physically interpretable predictions from partial observations while being on par with black-box models regarding prediction accuracy. The project code is available at: http://tinyurl.com/prb-networks
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation
In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as StableCamH. The key idea is to leverage cars found on the road as sources of scale supervision but to incorporate them in the training robustly. StableCamH detects and estimates the sizes of cars in the frame and aggregates scale information extracted from them into a camera height estimate whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network to become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and Cityscapes datasets show the effectiveness of StableCamH and its state-of-the-art accuracy compared with related methods. We also show that StableCamH enables training on mixed datasets of different camera heights, which leads to larger-scale training and thus higher generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and StableCamH democratizes its deployment by establishing the means to train any model as a metric depth estimator.
♻ ☆ NDT-Map-Code: A 3D global descriptor for real-time loop closure detection in lidar SLAM
Loop-closure detection, also known as place recognition, aiming to identify previously visited locations, is an essential component of a SLAM system. Existing research on lidar-based loop closure heavily relies on dense point cloud and 360 FOV lidars. This paper proposes an out-of-the-box NDT (Normal Distribution Transform) based global descriptor, NDT-Map-Code, designed for both on-road driving and underground valet parking scenarios. NDT-Map-Code can be directly extracted from the NDT map without the need for a dense point cloud, resulting in excellent scalability and low maintenance cost. The NDT representation is leveraged to identify representative patterns, which are further encoded according to their spatial location (bearing, range, and height). Experimental results on the NIO underground parking lot dataset and the KITTI dataset demonstrate that our method achieves significantly better performance compared to the state-of-the-art.
comment: 8 pages, 6 figures, 4 tables
♻ ☆ Vision-State Fusion: Improving Deep Neural Networks for Autonomous Robotics
Vision-based deep learning perception fulfills a paramount role in robotics, facilitating solutions to many challenging scenarios, such as acrobatic maneuvers of autonomous unmanned aerial vehicles (UAVs) and robot-assisted high-precision surgery. Control-oriented end-to-end perception approaches, which directly output control variables for the robot, commonly take advantage of the robot's state estimation as an auxiliary input. When intermediate outputs are estimated and fed to a lower-level controller, i.e. mediated approaches, the robot's state is commonly used as an input only for egocentric tasks, which estimate physical properties of the robot itself. In this work, we propose to apply a similar approach for the first time -- to the best of our knowledge -- to non-egocentric mediated tasks, where the estimated outputs refer to an external subject. We prove how our general methodology improves the regression performance of deep convolutional neural networks (CNNs) on a broad class of non-egocentric 3D pose estimation problems, with minimal computational cost. By analyzing three highly-different use cases, spanning from grasping with a robotic arm to following a human subject with a pocket-sized UAV, our results consistently improve the R\textsuperscript{2} regression metric, up to +0.51, compared to their stateless baselines. Finally, we validate the in-field performance of a closed-loop autonomous cm-scale UAV on the human pose estimation task. Our results show a significant reduction, i.e., 24\% on average, on the mean absolute error of our stateful CNN, compared to a State-of-the-Art stateless counterpart.
comment: This paper has been accepted for publication in the Journal of Intelligent & Robotic Systems. \copyright 2024 Springer
♻ ☆ Distributed Pose-graph Optimization with Multi-level Partitioning for Collaborative SLAM
The back-end module of Distributed Collaborative Simultaneous Localization and Mapping (DCSLAM) requires solving a nonlinear Pose Graph Optimization (PGO) under a distributed setting, also known as SE(d)-synchronization. Most existing distributed graph optimization algorithms employ a simple sequential partitioning scheme, which may result in unbalanced subgraph dimensions due to the different geographic locations of each robot, and hence imposes extra communication load. Moreover, the performance of current Riemannian optimization algorithms can be further accelerated. In this letter, we propose a novel distributed pose graph optimization algorithm combining multi-level partitioning with an accelerated Riemannian optimization method. Firstly, we employ the multi-level graph partitioning algorithm to preprocess the naive pose graph to formulate a balanced optimization problem. In addition, inspired by the accelerated coordinate descent method, we devise an Improved Riemannian Block Coordinate Descent (IRBCD) algorithm and the critical point obtained is globally optimal. Finally, we evaluate the effects of four common graph partitioning approaches on the correlation of the inter-subgraphs, and discover that the Highest scheme has the best partitioning performance. Also, we implement simulations to quantitatively demonstrate that our proposed algorithm outperforms the state-of-the-art distributed pose graph optimization protocols.
♻ ☆ Results and Lessons Learned from Autonomous Driving Transportation Services in Airfield, Crowded Indoor, and Urban Environments
Autonomous vehicles have been actively investigated over the past few decades. Several recent works show the potential of autonomous vehicles in urban environments with impressive experimental results. However, these works note that autonomous vehicles are still occasionally inferior to expert drivers in complex scenarios. Furthermore, they do not focus on the possibilities of autonomous driving transportation services in other areas beyond urban environments. This paper presents the research results and lessons learned from autonomous driving transportation services in airfield, crowded indoor, and urban environments. We discuss how we address several unique challenges in these diverse environments. We also offer an overview of remaining challenges that have not received much attention but must be addressed. This paper aims to share our unique experience to support researchers who are interested in exploring autonomous driving transportation services in various real-world environments.
comment: 8 pages, 7 figures, 4 tables
♻ ☆ Improving Out-of-Distribution Generalization of Learned Dynamics by Learning Pseudometrics and Constraint Manifolds ICRA 2024
We propose a method for improving the prediction accuracy of learned robot dynamics models on out-of-distribution (OOD) states. We achieve this by leveraging two key sources of structure often present in robot dynamics: 1) sparsity, i.e., some components of the state may not affect the dynamics, and 2) physical limits on the set of possible motions, in the form of nonholonomic constraints. Crucially, we do not assume this structure is known a priori, and instead learn it from data. We use contrastive learning to obtain a distance pseudometric that uncovers the sparsity pattern in the dynamics, and use it to reduce the input space when learning the dynamics. We then learn the unknown constraint manifold by approximating the normal space of possible motions from the data, which we use to train a Gaussian process (GP) representation of the constraint manifold. We evaluate our approach on a physical differential-drive robot and a simulated quadrotor, showing improved prediction accuracy on OOD data relative to baselines.
comment: Accept to ICRA 2024, 6 pages + references
♻ ☆ LingoQA: Video Question Answering for Autonomous Driving
Autonomous driving has long faced a challenge with public acceptance due to the lack of explainability in the decision-making process. Video question-answering (QA) in natural language provides the opportunity for bridging this gap. Nonetheless, evaluating the performance of Video QA models has proved particularly tough due to the absence of comprehensive benchmarks. To fill this gap, we introduce LingoQA, a benchmark specifically for autonomous driving Video QA. The LingoQA trainable metric demonstrates a 0.95 Spearman correlation coefficient with human evaluations. We introduce a Video QA dataset of central London consisting of 419k samples that we release with the paper. We establish a baseline vision-language model and run extensive ablation studies to understand its performance.
comment: Benchmark and dataset are available at https://github.com/wayveai/LingoQA/
♻ ☆ Designing Library of Skill-Agents for Hardware-Level Reusability
To use new robot hardware in a new environment, it is necessary to develop a control program tailored to that specific robot in that environment. Considering the reusability of software among robots is crucial to minimize the effort involved in this process and maximize software reuse across different robots in different environments. This paper proposes a method to remedy this process by considering hardware-level reusability, using Learning-from-observation (LfO) paradigm with a pre-designed skill-agent library. The LfO framework represents the required actions in hardware-independent representations, referred to as task models, from observing human demonstrations, capturing the necessary parameters for the interaction between the environment and the robot. When executing the desired actions from the task models, a set of skill agents is employed to convert the representations into robot commands. This paper focuses on the latter part of the LfO framework, utilizing the set to generate robot actions from the task models, and explores a hardware-independent design approach for these skill agents. These skill agents are described in a hardware-independent manner, considering the relative relationship between the robot's hand position and the environment. As a result, it is possible to execute these actions on robots with different hardware configurations by simply swapping the inverse kinematics solver. This paper, first, defines a necessary and sufficient skill-agent set corresponding to cover all possible actions, and considers the design principles for these skill agents in the library. We provide concrete examples of such skill agents and demonstrate the practicality of using these skill agents by showing that the same representations can be executed on two different robots, Nextage and Fetch, using the proposed skill-agents set.
♻ ☆ Working Backwards: Learning to Place by Picking IROS'24
We present placing via picking (PvP), a method to autonomously collect real-world demonstrations for a family of placing tasks in which objects must be manipulated to specific contact-constrained locations. With PvP, we approach the collection of robotic object placement demonstrations by reversing the grasping process and exploiting the inherent symmetry of the pick and place problems. Specifically, we obtain placing demonstrations from a set of grasp sequences of objects initially located at their target placement locations. Our system can collect hundreds of demonstrations in contact-constrained environments without human intervention by combining two modules: tactile regrasping and compliant control for grasps. We train a policy directly from visual observations through behavioral cloning, using the autonomously-collected demonstrations. By doing so, the policy can generalize to object placement scenarios outside of the training environment without privileged information (e.g., placing a plate picked up from a table). We validate our approach in home robotic scenarios that include dishwasher loading and table setting. Our approach yields robotic placing policies that outperform policies trained with kinesthetic teaching, both in terms of performance and data efficiency, while requiring no human supervision.
comment: Submitted to the IEEE/RSJ International Conference on Intelligent Robotics and Systems (IROS'24), Abu Dhabi, UAE, Oct 14-18, 2024
♻ ☆ PhotoBot: Reference-Guided Interactive Photography via Natural Language IROS'24
We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery. We leverage a visual language model (VLM) and an object detector to characterize the reference images via textual descriptions and then use a large language model (LLM) to retrieve relevant reference images based on a user's language query through text-based reasoning. To correspond the reference image and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across marked appearance variations. Using these features, we compute pose adjustments for an RGB-D camera by solving a perspective-n-point (PnP) problem. We demonstrate our approach using a manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback. We also show that PhotoBot can generalize to other reference sources such as paintings.
comment: Submitted to the IEEE/RSJ International Conference on Intelligent Robotics and Systems (IROS'24), Abu Dhabi, UAE, Oct 14-18, 2024
♻ ☆ Social Robots for Sleep Health: A Scoping Review
Poor sleep health is an increasingly concerning public healthcare crisis, especially when coupled with a dwindling number of health professionals qualified to combat it. However, there is a growing body of scientific literature on the use of digital technologies in supporting and sustaining individuals' healthy sleep habits. Social robots are a relatively recent technology that has been used to facilitate health care interventions and may have potential in improving sleep health outcomes, as well. Social robots' unique characteristics -- such as anthropomorphic physical embodiment or effective communication methods -- help to engage users and motivate them to comply with specific interventions, thus improving the interventions' outcomes. This scoping review aims to evaluate current scientific evidence for employing social robots in sleep health interventions, identify critical research gaps, and suggest future directions for developing and using social robots to improve people's sleep health. Our analysis of the reviewed studies found them limited due to a singular focus on the older adult population, use of small sample sizes, limited intervention durations, and other compounding factors. Nevertheless, the reviewed studies reported several positive outcomes, highlighting the potential social robots hold in this field. Although our review found limited clinical evidence for the efficacy of social robots as purveyors of sleep health interventions, it did elucidate the potential for a successful future in this domain if current limitations are addressed and more research is conducted.
♻ ☆ LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement
We introduce a novel approach to the executable semantic object rearrangement problem. In this challenge, a robot seeks to create an actionable plan that rearranges objects within a scene according to a pattern dictated by a natural language description. Unlike existing methods such as StructFormer and StructDiffusion, which tackle the issue in two steps by first generating poses and then leveraging a task planner for action plan formulation, our method concurrently addresses pose generation and action planning. We achieve this integration using a Language-Guided Monte-Carlo Tree Search (LGMCTS). Quantitative evaluations are provided on two simulation datasets, and complemented by qualitative tests with a real robot.
comment: Our code and supplementary materials are accessible at https://github.com/changhaonan/LG-MCTS
Systems and Control
☆ Projection-free computation of robust controllable sets with constrained zonotopes
We study the problem of computing robust controllable sets for discrete-time linear systems with additive uncertainty. We propose a tractable and scalable approach to inner- and outer-approximate robust controllable sets using constrained zonotopes, when the additive uncertainty set is a symmetric, convex, and compact set. Our least-squares-based approach uses novel closed-form approximations of the Pontryagin difference between a constrained zonotopic minuend and a symmetric, convex, and compact subtrahend. Unlike existing approaches, our approach does not rely on convex optimization solvers, and is projection-free for ellipsoidal and zonotopic uncertainty sets. We also propose a least-squares-based approach to compute a convex, polyhedral outer-approximation to constrained zonotopes, and characterize sufficient conditions under which all these approximations are exact. We demonstrate the computational efficiency and scalability of our approach in several case studies, including the design of abort-safe rendezvous trajectories for a spacecraft in near-rectilinear halo orbit under uncertainty. Our approach can inner-approximate a 20-step robust controllable set for a 100-dimensional linear system in under 15 seconds on a standard computer.
comment: 22 pages, 6 figures
☆ On Optimal Management of Energy Storage Systems in Renewable Energy Communities
Renewable energy communities are legal entities involving the association of citizens, organizations and local businesses aimed at contributing to the green energy transition and providing social, environmental and economic benefits to their members. This goal is pursued through the cooperative efforts of the community actors and by increasing the local energy self-consumption. In this paper, the optimal energy community operation in the presence of energy storage units is addressed. By exploiting the flexibility provided by the storage facilities, the main task is to minimize the community energy bill by taking advantage of incentives related to local self-consumption. Optimality conditions are derived, and an explicit optimal solution is devised. Numerical simulations are provided to assess the performance of the proposed solution.
☆ Adaptive Reconstruction of Nonlinear Systems States via DREM with Perturbation Annihilation
A new adaptive observer is proposed for a certain class of nonlinear systems with bounded unknown input and parametric uncertainty. Unlike most existing solutions, the proposed approach ensures asymptotic convergence of the unknown parameters, state and perturbation estimates to an arbitrarily small neighborhood of the equilibrium point. The solution is based on the novel augmentation of a high-gain observer with the dynamic regressor extension and mixing (DREM) procedure enhanced with a perturbation annihilation algorithm. The aforementioned properties of the proposed solution are verified via numerical experiments.
comment: 6 pages, 2 figures
☆ Macroscopic pricing schemes for the utilization of pool ride-hailing vehicles in bus lanes
With the increasing popularity of ride-hailing services, new modes of transportation are having a significant impact on the overall performance of transportation networks. As a result, there is a need to ensure that both the various transportation alternatives and the spatial network resources are used efficiently. In this work, we analyze a network configuration where part of the urban transportation network is devoted to dedicated bus lanes. Apart from buses, we let pool ride-hailing trips use the dedicated bus lanes which, contingent upon the demand for the remaining modes, may result in faster trips for users opting for the pooling alternative. Under an aggregated modelling framework, we characterize the spatial configuration and the multi-modal demand split for which this strategy achieves a system optimum. For these specific scenarios, we compute the equilibrium when ride-hailing users can choose between solo and pool services, and we provide a pricing scheme for mitigating the gap between total user delays of the system optimum and user equilibrium solutions, when needed.
comment: Extended version of paper accepted to European Control Conference (ECC) 2024
☆ Priority-based Energy Allocation in Buildings for Distributed Model Predictive Control
Many countries are facing energy shortage today and most of the global energy is consumed by HVAC systems in buildings. For the scenarios where the energy system is not sufficiently supplied to HVAC systems, a priority-based allocation scheme based on distributed model predictive control is proposed in this paper, which distributes the energy rationally based on priority order. According to the scenarios, two distributed allocation strategies, i.e., one-to-one priority strategy and multi-to-one priority strategy, are developed in this paper and validated by simulation in a building containing three zones and a building containing 36 rooms, respectively. Both strategies fully exploit the potential of predictive control solutions. The experiment shows that our scheme has good scalability and achieve the performance of centralized strategy while making the calculation tractable.
☆ 3D Directed Formation Control with Global Shape Convergence using Bispherical Coordinates
In this paper, we present a novel 3D formation control scheme for directed graphs in a leader-follower configuration, achieving (almost) global convergence to the desired shape. Specifically, we introduce three controlled variables representing bispherical coordinates that uniquely describe the formation in 3D. Acyclic triangulated directed graphs (a class of minimally acyclic persistent graphs) are used to model the inter-agent sensing topology, while the agents' dynamics are governed by single-integrator model. Our analysis demonstrates that the proposed decentralized formation controller ensures (almost) global asymptotic stability while avoiding potential shape ambiguities in the final formation. Furthermore, the control laws are implementable in arbitrarily oriented local coordinate frames of follower agents using only low-cost onboard vision sensors, making it suitable for practical applications. Finally, we validate our formation control approach by a simulation study.
comment: Submitted to ECC 2024
☆ Optimal control of continuous-time symmetric systems with unknown dynamics and noisy measurements
An iterative learning algorithm is presented for continuous-time linear-quadratic optimal control problems where the system is externally symmetric with unknown dynamics. Both finite-horizon and infinite-horizon problems are considered. It is shown that the proposed algorithm is globally convergent to the optimal solution and has some advantages over adaptive dynamic programming, including being unbiased under noisy measurements and having a relatively low computational burden. Numerical experiments show the effectiveness of the results.
☆ Bayesian Physics-informed Neural Networks for System Identification of Inverter-dominated Power Systems
While the uncertainty in generation and demand increases, accurately estimating the dynamic characteristics of power systems becomes crucial for employing the appropriate control actions to maintain their stability. In our previous work, we have shown that Bayesian Physics-informed Neural Networks (BPINNs) outperform conventional system identification methods in identifying the power system dynamic behavior under measurement noise. This paper takes the next natural step and addresses the more significant challenge, exploring how BPINN perform in estimating power system dynamics under increasing uncertainty from many Inverter-based Resources (IBRs) connected to the grid. These introduce a different type of uncertainty, compared to noisy measurements. The BPINN combines the advantages of Physics-informed Neural Networks (PINNs), such as inverse problem applicability, with Bayesian approaches for uncertainty quantification. We explore the BPINN performance on a wide range of systems, starting from a single machine infinite bus (SMIB) system and 3-bus system to extract important insights, to the 14-bus CIGRE distribution grid, and the large IEEE 118-bus system. We also investigate approaches that can accelerate the BPINN training, such as pretraining and transfer learning. Throughout this paper, we show that in presence of uncertainty, the BPINN achieves orders of magnitude lower errors than the widely popular method for system identification SINDy and significantly lower errors than PINN, while transfer learning helps reduce training time by up to 80 %.
comment: Submitted to Electric Power Systems Research
☆ Lattice piecewise affine approximation of explicit model predictive control with application to satellite attitude control
Satellite attitude cotrol is a crucial part of aerospace technology, and model predictive control(MPC) is one of the most promising controllers in this area, which will be less effective if real-time online optimization can not be achieved. Explicit MPC converts the online calculation into a table lookup process, however the solution is difficult to obtain if the system dimension is high or the constraints are complex. The lattice piecewise affine(PWA) function was used to represent the control law of explicit MPC, although the online calculation complexity is reduced, the offline calculation is still prohibitive for complex problems. In this paper, we use the sample points in the feasible region with their corresponding affine functions to construct the lattice PWA approximation of the optimal MPC controller designed for satellite attitude control. The asymptotic stability of satellite attitude control system under lattice PWA approximation has been proven, and simulations are executed to verify that the proposed method can achieve almost the same performance as linear online MPC with much lower online computational complexity and use less fuel than LQR method.
☆ Augmented Labeled Random Finite Sets and Its Application to Group Target Tracking
This paper addresses the problem of group target tracking (GTT), wherein multiple closely spaced targets within a group pose a coordinated motion. To improve the tracking performance, the labeled random finite sets (LRFSs) theory is adopted, and this paper develops a new kind of LRFSs, i.e., augmented LRFSs, which introduces group information into the definition of LRFSs. Specifically, for each element in an LRFS, the kinetic states, track label, and the corresponding group information of its represented target are incorporated. Furthermore, by means of the labeled multi-Bernoulli (LMB) filter with the proposed augmented LRFSs, the group structure is iteratively propagated and updated during the tracking process, which achieves the simultaneously estimation of the kinetic states, track label, and the corresponding group information of multiple group targets, and further improves the GTT tracking performance. Finally, simulation experiments are provided, which well demonstrates the effectiveness of the labeled multi-Bernoulli filter with the proposed augmented LRFSs for GTT tracking.
☆ Integrating Large Language Models for Severity Classification in Traffic Incident Management: A Machine Learning Approach
This study evaluates the impact of large language models on enhancing machine learning processes for managing traffic incidents. It examines the extent to which features generated by modern language models improve or match the accuracy of predictions when classifying the severity of incidents using accident reports. Multiple comparisons performed between combinations of language models and machine learning algorithms, including Gradient Boosted Decision Trees, Random Forests, and Extreme Gradient Boosting. Our research uses both conventional and language model-derived features from texts and incident reports, and their combinations to perform severity classification. Incorporating features from language models with those directly obtained from incident reports has shown to improve, or at least match, the performance of machine learning techniques in assigning severity levels to incidents, particularly when employing Random Forests and Extreme Gradient Boosting methods. This comparison was quantified using the F1-score over uniformly sampled data sets to obtain balanced severity classes. The primary contribution of this research is in the demonstration of how Large Language Models can be integrated into machine learning workflows for incident management, thereby simplifying feature extraction from unstructured text and enhancing or matching the precision of severity predictions using conventional machine learning pipeline. The engineering application of this research is illustrated through the effective use of these language processing models to refine the modelling process for incident severity classification. This work provides significant insights into the application of language processing capabilities in combination with traditional data for improving machine learning pipelines in the context of classifying incident severity.
☆ Adversarial Attacks and Defenses in Automated Control Systems: A Comprehensive Benchmark
Integrating machine learning into Automated Control Systems (ACS) enhances decision-making in industrial process management. One of the limitations to the widespread adoption of these technologies in industry is the vulnerability of neural networks to adversarial attacks. This study explores the threats in deploying deep learning models for fault diagnosis in ACS using the Tennessee Eastman Process dataset. By evaluating three neural networks with different architectures, we subject them to six types of adversarial attacks and explore five different defense methods. Our results highlight the strong vulnerability of models to adversarial samples and the varying effectiveness of defense strategies. We also propose a novel protection approach by combining multiple defense methods and demonstrate it's efficacy. This research contributes several insights into securing machine learning within ACS, ensuring robust fault diagnosis in industrial processes.
☆ Distributed Cooperative Formation Control of Nonlinear Multi-Agent System (UGV) Using Neural Network
The paper presented in this article deals with the issue of distributed cooperative formation of multi-agent systems (MASs). It proposes the use of appropriate neural network control methods to address formation requirements (uncertainties dynamic model). It considers an adaptive leader-follower distributed cooperative formation control based on neural networks (NNs) developed for a class of second-order nonlinear multi-agent systems and neural networks Neural networks are used to compute system data that inputs layer (position, velocity), hidden layers, and output layer. Through collaboration between leader-follower approaches and neural networks with complex systems or complex conditions receive an effective cooperative formation control method. The sufficient conditions for the system stability were derived using Lyapunov stability theory, graph theory, and state space methods. By simulation, the results of this study can be obtained from the main data of the multi-agent system in formation control and verified that the system can process consistency, stability, reliability, and accuracy in cooperative formation.
comment: 6 pages, 5 figures
☆ An Extended Kuramoto Model for Frequency and Phase Synchronization in Delay-Free Networks with Finite Number of Agents
Due to its description of a synchronization between oscillators, the Kuramoto model is an ideal choice for a synchronisation algorithm in networked systems. This requires to achieve not only a frequency synchronization but also a phase synchronization - something the standard Kuramoto model can not provide for a finite number of agents. In this case, a remaining phase difference is necessary to offset differences of the natural frequencies. Setting the Kuramoto model into the context of dynamic consensus and making use of the $n$th order discrete average consensus algorithm, this paper extends the standard Kuramoto model in such a way that frequency and phase synchronization are separated. This in turn leads to an algorithm achieve the required frequency and phase synchronization also for a finite number of agents. Simulations show the viability of this extended Kuramoto model.
comment: 8 pages, 6 figures. Shorter version submitted to the 63rd IEEE Conference on Decision and Control, 2024. Funded by BMBF through 6GEM research hub (16KISK038) and by DFG through project JCRS CoMP (504990291)
☆ Charged Momentum: Electric Vehicle Surge in India's 2023 Landscape
Electric vehicles (EVs) have emerged as a transformative force in India's transportation sector, offering a sustainable solution to the country's growing energy and environmental challenges. Against the backdrop of rapid urbanization, rising pollution levels, and the need for energy security, EVs have gained traction as a viable alternative to traditional internal combustion engine vehicles. This paper provides a comprehensive analysis of the electric vehicle market in India, focusing particularly on the landscape of 2023. It emphasizes key aspects such as the 2023 scenario of EV adoption, the role of indigenous manufacturers, dominant players shaping the market, and the influence of government policies and initiatives, including the FAME I and II schemes. Furthermore, the paper delves into EV sales data for the fiscal year 2023, offering insights into market trends and consumer preferences. By elucidating the current state of EVs in India, this paper aims to contribute to a deeper understanding of the country's transition towards sustainable mobility and its implications for energy, environment, and economy.
☆ Quantifying the Aggregate Flexibility of EV Charging Stations for Dependable Congestion Management Products: A Dutch Case Study
Electric vehicles (EVs) play a crucial role in the transition towards sustainable modes of transportation and thus are critical to the energy transition. As their number grows, managing the aggregate power of EV charging is crucial to maintain grid stability and mitigate congestion. This study analyses more than 500 thousand real charging transactions in the Netherlands to explore the challenge and opportunity for the energy system presented by EV growth and smart charging flexibility. Specifically, it analyses the collective ability to provide congestion management services according to the specifications of those services in the Netherlands. In this study, a data-driven model of charging behaviour is created to explore the implications of delivering dependable congestion management services at various aggregation levels and types of service. The probability of offering specific grid services by different categories of charging stations (CS) is analysed. These probabilities can help EV aggregators, such as charging point operators, make informed decisions about offering congestion mitigation products per relevant regulations and distribution system operators to assess their potential. The ability to offer different flexibility products, namely re-dispatch and capacity limitation, for congestion management, is assessed using various dispatch strategies. Next, machine learning models are used to predict the probability of CSs being able to deliver these products, accounting for uncertainties. Results indicate that residential charging locations have significant potential to provide both products during evening peak hours. While shared EVs offer better certainty regarding arrival and departure times, their small fleet size currently restricts their ability to meet the minimum order size of flexible products.
comment: 18 Pages,13 figures, 2 tables
☆ A Control-Recoverable Added-Noise-based Privacy Scheme for LQ Control in Networked Control Systems
As networked control systems continue to evolve, ensuring the privacy of sensitive data becomes an increasingly pressing concern, especially in situations where the controller is physically separated from the plant. In this paper, we propose a secure control scheme for computing linear quadratic control in a networked control system utilizing two networked controllers, a privacy encoder and a control restorer. Specifically, the encoder generates two state signals blurred with random noise and sends them to the controllers, while the restorer reconstructs the correct control signal. The proposed design effectively preserves the privacy of the control system's state without sacrificing the control performance. We theoretically quantify the privacy-preserving performance in terms of the state estimation error of the controllers and the disclosure probability. Additionally, the proposed privacy-preserving scheme is also proven to satisfy differential privacy. Moreover, we extend the proposed privacy-preserving scheme and evaluation method to cases where collusion between two controllers occurs. Finally, we verify the validity of our proposed scheme through simulations.
☆ A Log-domain Interior Point Method for Convex Quadratic Games
In this paper, we propose an equilibrium-seeking algorithm for finding generalized Nash equilibria of non-cooperative monotone convex quadratic games. Specifically, we recast the Nash equilibrium-seeking problem as variational inequality problem that we solve using a log-domain interior point method and provide a general purpose solver based on this algorithm. This approach is suitable for non-potential, general sum games and does not require extensive structural assumptions. We demonstrate the efficiency and versatility of our method using three benchmark games and demonstrate our algorithm is especially effective on small to medium scale problems.
☆ Observer-Based Environment Robust Control Barrier Functions for Safety-critical Control with Dynamic Obstacles
This paper proposes a safety-critical controller for dynamic and uncertain environments, leveraging a robust environment control barrier function (ECBF) to enhance the robustness against the measurement and prediction uncertainties associated with moving obstacles. The approach reduces conservatism, compared with a worst-case uncertainty approach, by incorporating a state observer for obstacles into the ECBF design. The controller, which guarantees safety, is achieved through solving a quadratic programming problem. The proposed method's effectiveness is demonstrated via a dynamic obstacle-avoidance problem for an autonomous vehicle, including comparisons with established baseline approaches.
☆ Network-Aware Value Stacking of Community Battery via Asynchronous Distributed Optimization
Community battery systems have been widely deployed to provide services to the grid. Unlike a single battery storage system in the community, coordinating multiple community batteries can further unlock their value, enhancing the viability of community battery solutions. However, the centralized control of community batteries relies on the full information of the system, which is less practical and may even lead to privacy leakage. In this paper, we formulate a value-stacking optimization problem for community batteries to interact with local solar, buildings, and the grid, within distribution network constraints. We then propose a distributed algorithm using asynchronous distributed alternate direction method of multipliers (ADMM) to solve the problem. Our algorithm is robust to communication latency between community batteries and the grid while preserving the operational privacy. The simulation results demonstrate the convergence of our proposed asynchronous distributed ADMM algorithm. We also evaluate the electricity cost and the contribution of each value stream in the value-stacking problem for community batteries using real-world data.
comment: 2024 IEEE Power & Energy Society General Meeting (PESGM)
☆ Federated reinforcement learning for robot motion planning with zero-shot generalization
This paper considers the problem of learning a control policy for robot motion planning with zero-shot generalization, i.e., no data collection and policy adaptation is needed when the learned policy is deployed in new environments. We develop a federated reinforcement learning framework that enables collaborative learning of multiple learners and a central server, i.e., the Cloud, without sharing their raw data. In each iteration, each learner uploads its local control policy and the corresponding estimated normalized arrival time to the Cloud, which then computes the global optimum among the learners and broadcasts the optimal policy to the learners. Each learner then selects between its local control policy and that from the Cloud for next iteration. The proposed framework leverages on the derived zero-shot generalization guarantees on arrival time and safety. Theoretical guarantees on almost-sure convergence, almost consensus, Pareto improvement and optimality gap are also provided. Monte Carlo simulation is conducted to evaluate the proposed framework.
☆ Safety-Aware Reinforcement Learning for Electric Vehicle Charging Station Management in Distribution Network
The increasing integration of electric vehicles (EVs) into the grid can pose a significant risk to the distribution system operation in the absence of coordination. In response to the need for effective coordination of EVs within the distribution network, this paper presents a safety-aware reinforcement learning (RL) algorithm designed to manage EV charging stations while ensuring the satisfaction of system constraints. Unlike existing methods, our proposed algorithm does not rely on explicit penalties for constraint violations, eliminating the need for penalty coefficient tuning. Furthermore, managing EV charging stations is further complicated by multiple uncertainties, notably the variability in solar energy generation and energy prices. To address this challenge, we develop an off-policy RL algorithm to efficiently utilize data to learn patterns in such uncertain environments. Our algorithm also incorporates a maximum entropy framework to enhance the RL algorithm's exploratory process, preventing convergence to local optimal solutions. Simulation results demonstrate that our algorithm outperforms traditional RL algorithms in managing EV charging in the distribution network.
comment: 2024 IEEE Power & Energy Society General Meeting (PESGM)
☆ Performance-Guaranteed Solutions for Multi-Agent Optimal Coverage Problems using Submodularity, Curvature, and Greedy Algorithms
We consider a class of multi-agent optimal coverage problems where the goal is to determine the optimal placement of a group of agents in a given mission space such that they maximize a joint ``coverage'' objective. This class of problems is extremely challenging due to the non-convex nature of the mission space and of the coverage objective. With this motivation, we propose to use a greedy algorithm as a means of getting feasible coverage solutions efficiently. Even though such greedy solutions are suboptimal, the submodularity (diminishing returns) property of the coverage objective can be exploited to provide performance bound guarantees - not only for the greedy solutions but also for any subsequently improved solutions. Moreover, we show that improved performance bound guarantees (beyond the standard (1-1/e) performance bound) can be established using various curvature measures that further characterize the considered coverage problem. In particular, we provide a brief review of all existing popular curvature measures found in the submodular maximization literature, including a recent curvature measure that we proposed, and discuss in detail their applicability, practicality, and effectiveness in the context of optimal coverage problems. Moreover, we characterize the dependence of the effectiveness of different curvature measures (in providing improved performance bound guarantees) on the agent sensing capabilities. Finally, we provide several numerical results to support our findings and propose several potential future research directions.
comment: Will be submitted to CDC 2024
☆ A Unified Toll Lane Framework for Autonomous and High-Occupancy Vehicles in Interactive Mixed Autonomy
In this study, we introduce a toll lane framework that optimizes the mixed flow of autonomous and high-occupancy vehicles on freeways, where human-driven and autonomous vehicles of varying commuter occupancy share a segment. Autonomous vehicles, with their ability to maintain shorter headways, boost traffic throughput. Our framework designates a toll lane for autonomous vehicles with high occupancy to use free of charge, while others pay a toll. We explore the lane choice equilibria when all vehicles minimize travel costs, and characterize the equilibria by ranking vehicles by their mobility enhancement potential, a concept we term the mobility degree. Through numerical examples, we demonstrate the framework's utility in addressing design challenges such as setting optimal tolls, determining occupancy thresholds, and designing lane policies, showing how it facilitates the integration of high-occupancy and autonomous vehicles. We also propose an algorithm for assigning rational tolls to decrease total commuter delay and examine the effects of toll non-compliance. Our findings suggest that self-interest-driven behavior mitigates moderate non-compliance impacts, highlighting the framework's resilience. This work presents a pioneering comprehensive analysis of a toll lane framework that emphasizes the coexistence of autonomous and high-occupancy vehicles, offering insights for traffic management improvements and the integration of autonomous vehicles into existing transportation infrastructures.
☆ When are Lossy Energy Storage Optimization Models Convex?
We consider a class of optimization problems involving the optimal operation of a single lossy energy storage system that incurs energy loss when charging or discharging. Such inefficiencies in the energy storage dynamics are known to result in a nonconvex set of feasible charging and discharging power profiles. In this letter, we provide an equivalent reformulation for this class of optimization problems, along with sufficient conditions for the convexity of the proposed reformulation. The conditions provided generalize existing conditions for convexity in the literature.
comment: 5 pages, 1 figure
☆ Sensory Glove-Based Surgical Robot User Interface IROS
Robotic surgery has reached a high level of maturity and has become an integral part of standard surgical care. However, existing surgeon consoles are bulky and take up valuable space in the operating room, present challenges for surgical team coordination, and their proprietary nature makes it difficult to take advantage of recent technological advances, especially in virtual and augmented reality. One potential area for further improvement is the integration of modern sensory gloves into robotic platforms, allowing surgeons to control robotic arms directly with their hand movements intuitively. We propose one such system that combines an HTC Vive tracker, a Manus Meta Prime 3 XR sensory glove, and God Vision wireless smart glasses. The system controls one arm of a da Vinci surgical robot. In addition to moving the arm, the surgeon can use fingers to control the end-effector of the surgical instrument. Hand gestures are used to implement clutching and similar functions. In particular, we introduce clutching of the instrument orientation, a functionality not available in the da Vinci system. The vibrotactile elements of the glove are used to provide feedback to the user when gesture commands are invoked. A preliminary evaluation of the system shows that it has excellent tracking accuracy and allows surgeons to efficiently perform common surgical training tasks with minimal practice with the new interface; this suggests that the interface is highly intuitive. The proposed system is inexpensive, allows rapid prototyping, and opens opportunities for further innovations in the design of surgical robot interfaces.
comment: 6 pages, 5 figures, 7 tables, submitted to International Conference on Intelligent Robots and Systems (IROS)2024
☆ Safety-Aware Perception for Autonomous Collision Avoidance in Dynamic Environments
Autonomous collision avoidance requires accurate environmental perception; however, flight systems often possess limited sensing capabilities with field-of-view (FOV) restrictions. To navigate this challenge, we present a safety-aware approach for online determination of the optimal sensor-pointing direction $\psi_\text{d}$ which utilizes control barrier functions (CBFs). First, we generate a spatial density function $\Phi$ which leverages CBF constraints to map the collision risk of all local coordinates. Then, we convolve $\Phi$ with an attitude-dependent sensor FOV quality function to produce the objective function $\Gamma$ which quantifies the total observed risk for a given pointing direction. Finally, by finding the global optimizer for $\Gamma$, we identify the value of $\psi_\text{d}$ which maximizes the perception of risk within the FOV. We incorporate $\psi_\text{d}$ into a safety-critical flight architecture and conduct a numerical analysis using multiple simulated mission profiles. Our algorithm achieves a success rate of $88-96\%$, constituting a $16-29\%$ improvement compared to the best heuristic methods. We demonstrate the functionality of our approach via a flight demonstration using the Crazyflie 2.1 micro-quadrotor. Without a priori obstacle knowledge, the quadrotor follows a dynamic flight path while simultaneously calculating and tracking $\psi_\text{d}$ to perceive and avoid two static obstacles with an average computation time of 371 $\mu$s.
☆ Credit vs. Discount-Based Congestion Pricing: A Comparison Study
Tolling, or congestion pricing, offers a promising traffic management policy for regulating congestion, but has also attracted criticism for placing outsized financial burdens on low-income users. Credit-based congestion pricing (CBCP) and discount-based congestion pricing (DBCP) policies, which respectively provide travel credits and toll discounts to low-income users on tolled roads, have emerged as promising mechanisms for reducing traffic congestion without worsening societal inequities. However, the optimal design of CBCP and DBCP policies, as well as their relative advantages and disadvantages, remain poorly understood. To address this, we study the effects of implementing CBCP and DBCP policies to route users on a network of multi-lane highways with tolled express lanes. We formulate a non-atomic routing game framework in which a subset of eligible users is granted toll relief in the form of a fixed budget or toll discount, while the remaining ineligible users must pay out-of-pocket. We prove the existence of Nash equilibrium traffic flow patterns corresponding to any given CBCP or DBCP policy. Under the additional assumption that eligible users have time-invariant VoTs, we provide a convex program to efficiently compute these equilibria. For networks consisting of a single edge, we identify conditions under which CBCP policies outperform DBCP policies (and vice versa), in the sense of improving eligible users' access to the express lane. Finally, we present empirical results from a CBCP pilot study of the San Mateo 101 Express Lane Project in California. Our empirical results corroborate our theoretical analysis of the impact of deploying credit-based and discount-based policies, and lend insights into the sensitivity of their impact with respect to the travel demand and users' VoTs.
☆ Sequential Modeling of Complex Marine Navigation: Case Study on a Passenger Vessel (Student Abstract) AAAI 2024
The maritime industry's continuous commitment to sustainability has led to a dedicated exploration of methods to reduce vessel fuel consumption. This paper undertakes this challenge through a machine learning approach, leveraging a real-world dataset spanning two years of a ferry in west coast Canada. Our focus centers on the creation of a time series forecasting model given the dynamic and static states, actions, and disturbances. This model is designed to predict dynamic states based on the actions provided, subsequently serving as an evaluative tool to assess the proficiency of the ferry's operation under the captain's guidance. Additionally, it lays the foundation for future optimization algorithms, providing valuable feedback on decision-making processes. To facilitate future studies, our code is available at \url{https://github.com/pagand/model_optimze_vessel/tree/AAAI}
comment: 5 pages, 3 figures, AAAI 2024 student abstract
☆ Clustering Heuristics for Robust Energy Capacitated Vehicle Routing Problem (ECVRP)
The paper presents an approach to solving the Robust Energy Capacitated Vehicle Routing Problem (RECVRP), focusing on electric vehicles and their limited battery capacity. A finite number of customers, each with their own demand, have to be serviced by an electric vehicle fleet while ensuring that none of the vehicles run out of energy. The time and energy it takes to travel between any two points is modeled as a random variable with known distribution. We propose a Mixed Integer Program (MIP) for computing an exact solution and introduce clustering heuristics to enhance the solution speed. This enables efficient re-planning of routes in dynamic scenarios. The methodology transforms the RECVRP into smaller problems, yielding good quality solutions quickly compared to existing methods. We demonstrate the effectiveness of this approach using a well-known benchmark problem set as well as a set of randomly generated problems.
♻ ☆ Learning Algorithms for Verification of Markov Decision Processes
We present a general framework for applying learning algorithms and heuristical guidance to the verification of Markov decision processes (MDPs). The primary goal of our techniques is to improve performance by avoiding an exhaustive exploration of the state space, instead focussing on particularly relevant areas of the system, guided by heuristics. Our work builds on the previous results of Br{\'{a}}zdil et al., significantly extending it as well as refining several details and fixing errors. The presented framework focuses on probabilistic reachability, which is a core problem in verification, and is instantiated in two distinct scenarios. The first assumes that full knowledge of the MDP is available, in particular precise transition probabilities. It performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP without knowing the exact transition dynamics. Here, we obtain probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. In particular, the latter is an extension of statistical model-checking (SMC) for unbounded properties in MDPs. In contrast to other related approaches, we do not restrict our attention to time-bounded (finite-horizon) or discounted properties, nor assume any particular structural properties of the MDP.
♻ ☆ Secure Synchronization of Heterogeneous Pulse-Coupled Oscillators
In this paper, we consider the synchronization of heterogeneous pulse-coupled oscillators (PCOs), where some of the oscillators might be faulty or malicious. The oscillators interact through identical pulses at discrete instants and evolve continuously with different frequencies otherwise. Despite the presence of misbehaviors, benign oscillators aim to reach synchronization. To achieve this objective, two resilient synchronization protocols are developed in this paper by adapting the real-valued mean-subsequence reduced (MSR) algorithm to pulse-based interactions. The first protocol relies on packet-based communication to transmit absolute frequencies, while the second protocol operates purely with pulses to calculate relative frequencies. In both protocols, each normal oscillator periodically counts the received pulses to detect possible malicious behaviors. By disregarding suspicious pulses from its neighbors, the oscillator updates both its phases and frequencies. The paper establishes sufficient conditions on the initial states and graph structure under which resilient synchronization is achieved in the PCO network. Specifically, the normal oscillators can either detect the presence of malicious nodes or synchronize in both phases and frequencies. Additionally, a comparison between the two algorithms reveals a trade-off between relaxed initial conditions and reduced communication burden.
♻ ☆ Guaranteeing Control Requirements via Reward Shaping in Reinforcement Learning
In addressing control problems such as regulation and tracking through reinforcement learning, it is often required to guarantee that the acquired policy meets essential performance and stability criteria such as a desired settling time and steady-state error prior to deployment. Motivated by this necessity, we present a set of results and a systematic reward shaping procedure that (i) ensures the optimal policy generates trajectories that align with specified control requirements and (ii) allows to assess whether any given policy satisfies them. We validate our approach through comprehensive numerical experiments conducted in two representative environments from OpenAI Gym: the Inverted Pendulum swing-up problem and the Lunar Lander. Utilizing both tabular and deep reinforcement learning methods, our experiments consistently affirm the efficacy of our proposed framework, highlighting its effectiveness in ensuring policy adherence to the prescribed control requirements.
♻ ☆ Unbiased Parameter Estimation via DREM with Annihilators
In the adaptive control theory, the dynamic regressor extension and mixing (DREM) procedure has become widespread as it allows one to describe a variety of adaptive control problems in unified terms of the parameter estimation problem of a regression equation with a scalar regressor. However, when the system/parameterization is affected by perturbations, the estimation laws, which are designed on the basis of such equation, asymptotically provides only biased estimates. In this paper, based on the bias-eliminated least-squares (BELS) approach, a modification of the DREM procedure is proposed that allows one to annihilate perturbations asymptotically and, consequently, asymptotically obtain unbiased estimates. The theoretical results are supported with mathematical modelling and can be used to design adaptive observers and control systems.
comment: 7 pages, 2 figures
♻ ☆ On Continuous Full-Order Integral-Terminal Sliding Mode Control with Unknown Apriori Bound on Uncertainty
This study aims at providing a solution to the problem of designing a continuous and finite-time control for a class of nonlinear systems in the presence of matched uncertainty with an unknown apriori bound. First, we propose a Full-Order Integral-Terminal Sliding Manifold (FOITSM) with a conventional (discontinuous) sliding mode to show that it provides the combined attributes of the nonsingular terminal and integral sliding mode algorithms. Secondly, an Adaptive Disturbance Observer (ADO) has been designed to alleviate the effect of the uncertainty acting on the system. On application of the ADO-based Full-Order Integral-Terminal Sliding Mode Control (FOITSMC), the chattering phenomenon in control input has been reduced substantially in the presence of conditionally known matched disturbances. Moreover, the adaptive gains of ADO are updated non-monotonically without over-bounding the acting disturbance, yet sustain the global boundedness of state trajectories within a specific bound. %Finally, an application of the proposed algorithm for attitude stabilization of a rigid spacecraft has been successively shown.
comment: 14 pages, 9 figures
♻ ☆ Towards Perturbation-Induced Static Pivoting on GPU-Based Linear Solvers
Linear system solving is a key tool for computational power system studies, e.g., optimal power flow, transmission switching, or unit commitment. CPU-based linear system solver speeds, however, have saturated in recent years. Emerging research shows that GPU-based linear system solvers are beginning to achieve notable speedup over CPU-based alternatives in some applications. Due to the architecture of GPU memory access, numerical pivoting represents the new bottleneck which prevents GPU-based solvers from running even faster. Accordingly, this paper proposes a matrix perturbation-based method to induce static pivoting. Using this approach, a series of perturbed, well-conditioned, pivot-free linear systems are solved in parallel on GPUs. Matrix expansion routines are then used to linearly combine the results, and the true solution is recovered to an arbitrarily high degree of theoretical accuracy. We showcase the validity of our approach on distributed-slack AC power flow solve iterations associated with the PGLib 300-bus test case.
comment: submitted to PESGM 2024
♻ ☆ Design of Efficient Point-Mass Filter with Application in Terrain Aided Navigation
This paper deals with state estimation of stochastic models with linear state dynamics, continuous or discrete in time. The emphasis is laid on a numerical solution to the state prediction by the time-update step of the grid-point-based point-mass filter (PMF), which is the most computationally demanding part of the PMF algorithm. A novel efficient PMF (ePMF) estimator, unifying continuous and discrete, approaches is proposed, designed, and discussed. By numerical illustrations, it is shown, that the proposed ePMF can lead to a time complexity reduction that exceeds 99.9% without compromising accuracy. The MATLAB code of the ePMF is released with this paper.
comment: Pulibshed in proccedings of FUSION 2023. PLEASE cite the published version! The code is also now available at GitHub: https://github.com/IDM-UWB/efficient-PMF
♻ ☆ RISAR: RIS-assisted Human Activity Recognition with Commercial Wi-Fi Devices
Human activity recognition (HAR) holds significant importance in smart homes, security, and healthcare. Existing systems face limitations because of the insufficient spatial diversity provided by a limited number of antennas. Furthermore, inefficiencies in noise reduction and feature extraction from sensing data pose challenges to recognition performance. This study presents a reconfigurable intelligent surface (RIS)-assisted passive human activity recognition (RISAR) method, compatible with commercial Wi-Fi devices. RISAR leverages a RIS to enhance the spatial diversity of Wi-Fi signals, effectively capturing a wider range of information distributed across the spatial domain. A novel high-dimensional factor model based on random matrix theory is proposed to address noise reduction and feature extraction in the temporal domain. A dual-stream spatial-temporal attention network model is developed to assign variable weights to different characteristics and sequences, mimicking human cognitive processes in prioritizing essential information. Experimental analysis shows that RISAR significantly outperforms existing HAR methods in accuracy and efficiency, achieving an average accuracy of 97.26%. These findings underscore RISAR's adaptability and potential as a robust activity recognition solution in real environments.
♻ ☆ Vehicle Dispatching and Routing of On-Demand Intercity Ride-Pooling Services: A Multi-Agent Hierarchical Reinforcement Learning Approach
The integrated development of city clusters has given rise to an increasing demand for intercity travel. Intercity ride-pooling service exhibits considerable potential in upgrading traditional intercity bus services by implementing demand-responsive enhancements. Nevertheless, its online operations suffer the inherent complexities due to the coupling of vehicle resource allocation among cities and pooled-ride vehicle routing. To tackle these challenges, this study proposes a two-level framework designed to facilitate online fleet management. Specifically, a novel multi-agent feudal reinforcement learning model is proposed at the upper level of the framework to cooperatively assign idle vehicles to different intercity lines, while the lower level updates the routes of vehicles using an adaptive large neighborhood search heuristic. Numerical studies based on the realistic dataset of Xiamen and its surrounding cities in China show that the proposed framework effectively mitigates the supply and demand imbalances, and achieves significant improvement in both the average daily system profit and order fulfillment ratio.
♻ ☆ Contingency Analyses with Warm Starter using Probabilistic Graphical Model
Cyberthreats are an increasingly common risk to the power grid and can thwart secure grid operations. We propose to extend contingency analysis to include cyberthreat evaluations. However, unlike the traditional N-1 or N-2 contingencies, cyberthreats (e.g., MadIoT) require simulating hard-to-solve N-k (with k >> 2) contingencies in a practical amount of time. Purely physics-based power flow solvers, while being accurate, are slow and may not solve N-k contingencies in a timely manner, whereas the emerging data-driven alternatives are fast but not sufficiently generalizable, interpretable, and scalable. To address these challenges, we propose a novel conditional Gaussian Random Field-based data-driven method that performs fast and accurate evaluation of cyberthreats. It achieves speedup of contingency analysis by warm-starting simulations, i.e., improving starting points, for the physical solvers. To improve the physical interpretability and generalizability, the proposed method incorporates domain knowledge by considering the graphical nature of the grid topology. To improve scalability, the method applies physics-informed regularization that reduces model complexity. Experiments validate that simulating MadIoT-induced attacks with our warm starter becomes approximately 5x faster on a realistic 2000-bus system.
comment: arXiv admin note: substantial text overlap with arXiv:2205.03673
♻ ☆ Tropical modeling of battery swapping and charging station
We propose and investigate a queueing model of a battery swapping and charging station (BSCS) for electric vehicles (EVs). A new approach to the analysis of the queueing model is developed, which combines the representation of the model as a stochastic dynamic system with the use of the methods and results of tropical algebra, which deals with the theory and applications of algebraic systems with idempotent operations. We describe the dynamics of the queueing model by a system of recurrence equations that involve random variables (RVs) to represent the interarrival time of incoming EVs. A performance measure for the model is defined as the mean operation cycle time of the station. Furthermore, the system of equations is represented in terms of the tropical algebra in vector form as an implicit linear state dynamic equation. The performance measure takes on the meaning of the mean growth rate of the state vector (the Lyapunov exponent) of the dynamic system. By applying a solution technique of vector equations in tropical algebra, the implicit equation is transformed into an explicit one with a state transition matrix with random entries. The evaluation of the Lyapunov exponent reduces to finding the limit of the expected value of norms of tropical matrix products. This limit is then obtained using results from the tropical spectral theory of deterministic and random matrices. With this approach, we derive a new exact formula for the mean cycle time of the BSCS, which is given in terms of the expected value of the RVs involved. We present the results of the Monte Carlo simulation of the BSCS's operation, which show a good agreement with the exact solution. The application of the obtained solution to evaluate the performance of one BSCS and to find the optimal distribution of battery packs between stations in a network of BSCSs is discussed.
comment: 21 pages, 5 figures
♻ ☆ Application of tropical optimization for solving multicriteria problems of pairwise comparisons using log-Chebyshev approximation
We consider a decision-making problem to find absolute ratings of alternatives that are compared in pairs under multiple criteria, subject to constraints in the form of two-sided bounds on ratios between the ratings. Given matrices of pairwise comparisons made according to the criteria, the problem is formulated as the log-Chebyshev approximation of these matrices by a common consistent matrix (a symmetrically reciprocal matrix of unit rank) to minimize the approximation errors for all matrices simultaneously. We rearrange the approximation problem as a constrained multiobjective optimization problem of finding a vector that determines the approximating consistent matrix. The problem is then represented in the framework of tropical algebra, which deals with the theory and applications of idempotent semirings and provides a formal basis for fuzzy and interval arithmetic. We apply methods and results of tropical optimization to develop a new approach for handling the multiobjective optimization problem according to various principles of optimality. New complete solutions in the sense of the max-ordering, lexicographic ordering and lexicographic max-ordering optimality are obtained, which are given in a compact vector form ready for formal analysis and efficient computation. We present numerical examples of solving multicriteria problems of rating four alternatives from pairwise comparisons to illustrate the technique and compare it with others.
comment: 44 pages
Databases
Overview of Publicly Available Degradation Data Sets for Tasks within Prognostics and Health Management
Central to the efficacy of prognostics and health management methods is the acquisition and analysis of degradation data, which encapsulates the evolving health condition of engineering systems over time. Degradation data serves as a rich source of information, offering invaluable insights into the underlying degradation processes, failure modes, and performance trends of engineering systems. This paper provides an overview of publicly available degradation data sets.
☆ CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows
Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated checkpoints - an extension of the original Chandy-Lamport checkpoints from 1985. However, the reasons behind this prevalence of the coordinated approach remain anecdotal, as reported by practitioners of the stream processing community. At the same time, common checkpointing approaches, such as the uncoordinated and the communication-induced ones, remain largely unexplored. This paper is the first to address this gap by i) shedding light on why practitioners have favored the coordinated approach and ii) by investigating whether there are viable alternatives. To this end, we implement three checkpointing approaches that we surveyed and adapted for the distinct needs of streaming dataflows. Our analysis shows that the coordinated approach outperforms the uncoordinated and communication-induced protocols under uniformly distributed workloads. To our surprise, however, the uncoordinated approach is not only competitive to the coordinated one in uniformly distributed workloads, but it also outperforms the coordinated approach in skewed workloads. We conclude that rather than blindly employing coordinated checkpointing, research should focus on optimizing the very promising uncoordinated approach, as it can address issues with skew and support prevalent cyclic queries. We believe that our findings can trigger further research into checkpointing mechanisms.
☆ No more optimization rules: LLM-enabled policy-based multi-modal query optimizer (version 1)
Large language model (LLM) has marked a pivotal moment in the field of machine learning and deep learning. Recently its capability for query planning has been investigated, including both single-modal and multi-modal queries. However, there is no work on the query optimization capability of LLM. As a critical (or could even be the most important) step that significantly impacts the execution performance of the query plan, such analysis and attempts should not be missed. From another aspect, existing query optimizers are usually rule-based or rule-based + cost-based, i.e., they are dependent on manually created rules to complete the query plan rewrite/transformation. Given the fact that modern optimizers include hundreds to thousands of rules, designing a multi-modal query optimizer following a similar way is significantly time-consuming since we will have to enumerate as many multi-modal optimization rules as possible, which has not been well addressed today. In this paper, we investigate the query optimization ability of LLM and use LLM to design LaPuda, a novel LLM and Policy based multi-modal query optimizer. Instead of enumerating specific and detailed rules, LaPuda only needs a few abstract policies to guide LLM in the optimization, by which much time and human effort are saved. Furthermore, to prevent LLM from making mistakes or negative optimization, we borrow the idea of gradient descent and propose a guided cost descent (GCD) algorithm to perform the optimization, such that the optimization can be kept in the correct direction. In our evaluation, our methods consistently outperform the baselines in most cases. For example, the optimized plans generated by our methods result in 1~3x higher execution speed than those by the baselines.
comment: Yifan and Haodi contribute equally to the work
☆ Secure Query Processing with Linear Complexity
We present LINQ, the first join protocol with linear complexity (in both running time and communication) under the secure multi-party computation model (MPC). It can also be extended to support all free-connex queries, a large class of select-join-aggregate queries, still with linear complexity. This matches the plaintext result for the query processing problem, as free-connex queries are the largest class of queries known to be solvable in linear time in plaintext. We have then built a query processing system based on LINQ, and the experimental results show that LINQ significantly outperforms the state of the art. For example, it can finish a query on three relations with an output size of 1 million tuples in around 100s in the LAN setting, while existing protocols that support the query cannot finish in an hour. Thus LINQ brings MPC query processing closer to practicality.
☆ Distance Comparison Operators for Approximate Nearest Neighbor Search: Exploration and Benchmark
Approximate nearest neighbor search (ANNS) on high-dimensional vectors has become a fundamental and essential component in various machine learning tasks. Prior research has shown that the distance comparison operation is the bottleneck of ANNS, which determines the query and indexing performance. To overcome this challenge, some novel methods have been proposed recently. The basic idea is to estimate the actual distance with fewer calculations, at the cost of accuracy loss. Inspired by this, we also propose that some classical techniques and deep learning models can also be adapted to this purpose. In this paper, we systematically categorize the techniques that have been or can be used to accelerate distance approximation. And to help the users understand the pros and cons of different techniques, we design a fair and comprehensive benchmark, Fudist implements these techniques with the same base index and evaluates them on 16 real datasets with several evaluation metrics. Designed as an independent and portable library, Fudist is orthogonal to the specific index structure and thus can be easily utilized in the current ANNS library to achieve significant improvements.
☆ A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs
Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m- dimensional random walk that accounts for the paths specified in a hypothesis. We further optimize its time efficiency and propose PHASEopt. Experiments on real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling in terms of accuracy and time efficiency.
☆ Unifews: Unified Entry-Wise Sparsification for Efficient Graph Neural Network
Graph Neural Networks (GNNs) have shown promising performance in various graph learning tasks, but at the cost of resource-intensive computations. The primary overhead of GNN update stems from graph propagation and weight transformation, both involving operations on graph-scale matrices. Previous studies attempt to reduce the computational budget by leveraging graph-level or network-level sparsification techniques, resulting in downsized graph or weights. In this work, we propose Unifews, which unifies the two operations in an entry-wise manner considering individual matrix elements, and conducts joint edge-weight sparsification to enhance learning efficiency. The entry-wise design of Unifews enables adaptive compression across GNN layers with progressively increased sparsity, and is applicable to a variety of architectural designs with on-the-fly operation simplification. Theoretically, we establish a novel framework to characterize sparsified GNN learning in view of a graph optimization process, and prove that Unifews effectively approximates the learning objective with bounded error and reduced computational load. We conduct extensive experiments to evaluate the performance of our method in diverse settings. Unifews is advantageous in jointly removing more than 90% of edges and weight entries with comparable or better accuracy than baseline models. The sparsification offers remarkable efficiency improvements including 10-20x matrix operation reduction and up to 100x acceleration in graph propagation time for the largest graph at the billion-edge scale.
☆ Database Dependencies and Formal Concept Analysis
This is an account of the characterization of database dependencies with Formal Concept Analysis.
☆ DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model
Tabular data plays a crucial role in various domains but often suffers from missing values, thereby curtailing its potential utility. Traditional imputation techniques frequently yield suboptimal results and impose substantial computational burdens, leading to inaccuracies in subsequent modeling tasks. To address these challenges, we propose DiffImpute, a novel Denoising Diffusion Probabilistic Model (DDPM). Specifically, DiffImpute is trained on complete tabular datasets, ensuring that it can produce credible imputations for missing entries without undermining the authenticity of the existing data. Innovatively, it can be applied to various settings of Missing Completely At Random (MCAR) and Missing At Random (MAR). To effectively handle the tabular features in DDPM, we tailor four tabular denoising networks, spanning MLP, ResNet, Transformer, and U-Net. We also propose Harmonization to enhance coherence between observed and imputed data by infusing the data back and denoising them multiple times during the sampling stage. To enable efficient inference while maintaining imputation performance, we propose a refined non-Markovian sampling process that works along with Harmonization. Empirical evaluations on seven diverse datasets underscore the prowess of DiffImpute. Specifically, when paired with the Transformer as the denoising network, it consistently outperforms its competitors, boasting an average ranking of 1.7 and the most minimal standard deviation. In contrast, the next best method lags with a ranking of 2.8 and a standard deviation of 0.9. The code is available at https://github.com/Dendiiiii/DiffImpute.
comment: 26 pages, 6 figures
☆ On Enforcing Existence and Non-Existence Constraints in MatBase
Existence constraints were defined in the Relational Data Model, but, unfortunately, are not provided by any Relational Database Management System, except for their NOT NULL particular case. Our (Elementary) Mathematical Data Model extended them to function products and introduced their dual non-existence constraints. MatBase, an intelligent data and knowledge base management system prototype based on both these data models, not only provides existence and non-existence constraints, but also automatically generates code for their enforcement. This paper presents and discusses the algorithms used by MatBase to enforce these types of constraints.
comment: Submitted to the BOHR International Journal of Computer Science (BIJCS), ISSN: 2583-455X, on March 20, 2024
♻ ☆ On enforcing dyadic-type homogeneous binary function product constraints in MatBase
Homogeneous binary function products are often encountered in the sub-universes modeled by databases, from genealogical trees to sports, from education to healthcare, etc. Their properties must be discovered and enforced by the software applications managing such data to guarantee plausibility. The (Elementary) Mathematical Data Model provides 18 dyadic-type homogeneous binary function product constraint types. MatBase, an intelligent data and knowledge base management system prototype, allows database designers to simply declare them by only clicking corresponding checkboxes and automatically generates code for enforcing them. This paper describes the algorithms that MatBase uses for enforcing all these 18 homogeneous binary function product constraint types, which may also be used by developers not having access to MatBase.
comment: submitted on Dec. 7, 2023, to the Journal of Data Science and Intelligent Systems (JDSIS), on Dec. 20, 2023, to the Journal of Computational and Cognitive Engineering, both of Bon View Publishing, Singapore, on Dec. 30 to the Journal of Current Research and Studies, and on Jan. 25, 2024, to the Journal of Computer Science Research, Bilingual Publishing Group, Singapore
♻ ☆ WaZI: A Learned and Workload-aware Z-Index EDBT 2024
Learned indexes fit machine learning (ML) models to the data and use them to make query operations more time and space-efficient. Recent works propose using learned spatial indexes to improve spatial query performance by optimizing the storage layout or internal search structures according to the data distribution. However, only a few learned indexes exploit the query workload distribution to enhance their performance. In addition, building and updating learned spatial indexes are often costly on large datasets due to the inefficiency of (re)training ML models. In this paper, we present WaZI, a learned and workload-aware variant of the Z-index, which jointly optimizes the storage layout and search structures, as a viable solution for the above challenges of spatial indexing. Specifically, we first formulate a cost function to measure the performance of a Z-index on a dataset for a range-query workload. Then, we optimize the Z-index structure by minimizing the cost function through adaptive partitioning and ordering for index construction. Moreover, we design a novel page-skipping mechanism to improve the query performance of WaZI by reducing access to irrelevant data pages. Our extensive experiments show that the WaZI index improves range query time by 40% on average over the baselines while always performing better or comparably to state-of-the-art spatial indexes. Additionally, it also maintains good point query performance. Generally, WaZI provides favorable tradeoffs among query latency, construction time, and index size.
comment: Camera-ready version accepted to EDBT 2024
Emerging Technologies
☆ A Fully Automated Platform for Evaluating ReRAM Crossbars
Resistive Random Access Memory (ReRAM) is a promising candidate for implementing Computing-in-Memory (CIM) architectures and neuromorphic circuits. ReRAM cells exhibit significant variability across different memristive devices and cycles, necessitating further improvements in the areas of devices, algorithms, and applications. To achieve this, understanding the stochastic behavior of the different ReRAM technologies is essential. The NeuroBreakoutBoard (NBB) is a versatile instrumentation platform to characterize Non-Volatile Memories (NVMs). However, the NBB itself does not provide any functionality in the form of software or a controller. In this paper, we present a control board for the NBB able to perform reliability assessments of 1T1R ReRAM crossbars. In more detail, an interface that allows a host PC to communicate with the NBB via the new control board is implemented. In a case study, we analyze the Cycle-to-Cycle (C2C) variation and read disturb TiN/Ti/HfO2/TiN cells for different read voltages to gain an understanding of their operational behavior.
♻ ☆ Lightweight, error-tolerant edge detection using memristor-enabled stochastic logics
The demand for efficient edge vision has spurred the interest in developing stochastic computing approaches for performing image processing tasks. Memristors with inherent stochasticity readily introduce probability into the computations and thus enable stochastic image processing computations. Here, we present a stochastic computing approach for edge detection, a fundamental image processing technique, facilitated with memristor-enabled stochastic logics. Specifically, we integrate the memristors with logic circuits and harness the stochasticity from the memristors to realize compact stochastic logics for stochastic number encoding and processing. The stochastic numbers, exhibiting well-regulated probabilities and correlations, can be processed to perform logic operations with statistical probabilities. This can facilitate lightweight stochastic edge detection for edge visual scenarios characterized with high-level noise errors. As a practical demonstration, we implement a hardware stochastic Roberts cross operator using the stochastic logics, and prove its exceptional edge detection performance, remarkably, with 95% less computational cost while withstanding 50% bit-flip errors. The results underscore the great potential of our stochastic edge detection approach in developing lightweight, error-tolerant edge vision hardware and systems for autonomous driving, virtual/augmented reality, medical imaging diagnosis, industrial automation, and beyond.